Technology
Reassessing Code Authorship Attribution in the Era of Language Models
Key Points
Announce Type: replace Abstract: The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism.
arXiv:2506.17120v2 Announce Type: replace
Abstract: The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism. Past techniques for CAA tend to leverage hand-crafted code-related features typically carry limitations that prevent proper authorship characterization and lead to sensitivities to adversarial attacks.
Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of SE tasks, and in authorship attribution for natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA on six diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth quantitative and qualitative analysis of our studied models' performance on CAA using established interpretability techniques. Our results illustrate important aspects of the behavior of LMs in understanding stylometric code patterns.