Reassessing Code Authorship Attribution in the Era of Language Models

arXiv CS Monday 01 June 2026, 04:00 UTC By Atish Kumar Dipongkor, Ziyu Yao, Kevin Moran 1 min read

Key Points

Announce Type: replace Abstract: The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism.

arXiv:2506.17120v2 Announce Type: replace Abstract: The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism. Past techniques for CAA tend to leverage hand-crafted code-related features typically carry limitations that prevent proper authorship characterization and lead to sensitivities to adversarial attacks. Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of SE tasks, and in authorship attribution for natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA on six diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth quantitative and qualitative analysis of our studied models' performance on CAA using established interpretability techniques. Our results illustrate important aspects of the behavior of LMs in understanding stylometric code patterns.

the Era of Language Models arXiv:2506.17120v2 (EVENT) Code Stylometry (ORG) CAA (ORG) SE (PERSON) NLP (ORG)

Originally published by arXiv CS Read original →

Reassessing Code Authorship Attribution in the Era of Language Models

Related Stories

Google will save your Lens photos, Search Live recordings, and Translate audio for AI training

ASML to Cut Fewer Jobs Than Planned After Union Negotiations

Engadget Podcast: WWDC 2026 thoughts from Apple Park

German court holds Google liable for false AI Overview answers