The Invisible Fingerprint in Code

Digital Traces in Code
Every program contains characteristic patterns of its developers, starting with the choice of variable names and preferred programming paradigms. Some developers rely on iterative solutions using loops, while others prefer recursive approaches. These individualized coding structures reflect the author’s personal style. Until now, however, such features could only be analyzed in the source code, or if the potential author was already known during the training phase of the AI. This limitation posed a challenge for malicious software, which is typically only available as compiled machine code.
In collaboration with the Agentur für Innovation in der Cybersicherheit GmbH (Innovation for Cybersecurity), our team at the University of Lübeck in Germany has developed a new approach called OCEAN (Open-World Contrastive Authorship Identification) ( https://arxiv.org/pdf/2412.05049 ) as part of the SOVEREIGN research project ( https://sovereign-project.de/ ), which recognizes the author’s coding style even in highly optimized machine code.
Operating in an ‘open world’ scenario, OCEAN makes informed judgments even if the suspected author was not included in the training data.
However, the approach has only been tested on regular, unobfuscated code. Given that some malicious software uses targeted obfuscation techniques to hide its origins, we plan to investigate countermeasures in future research.
Securing the Software Supply Chain
Within the SOVEREIGN project, which is dedicated to the protection of critical infrastructures, OCEAN plays a central role in securing the software supply chain. Attacks that manipulate updates to introduce backdoors are an increasingly serious threat. Such manipulations often go unnoticed because the injected code integrates seamlessly with existing software. This is where OCEAN’s technology comes in: by comparing the coding style of an update with the software’s previous state, it can detect stylistically inconsistent code segments that may signal malicious interference.
In a case study, we simulated a supply-chain attack by integrating known malware into a software update. We were able to measure a significant deviation in authorship. While this does not automatically mean that a supply chain has been compromised, it should act as a wake-up call, prompting either further automated analysis of the code block or a call to affected companies to investigate possible changes in their development teams.
The Technology Behind OCEAN
At its core, OCEAN relies on contrastive learning, a machine learning technique that compares programs in pairs to determine if they were created by the same author. This method is similar to the one used in smartphone facial recognition systems, where contrastive learning helps to verify if two images belong to the same person.
The system’s backbone is a neural network specially optimized for processing program codes. A significant advantage is that OCEAN does not depend solely on the original source code, but also works reliably with machine code. Tests conducted on real open-source programs demonstrated that the system can achieve an accuracy rate of 86%, even when facing high levels of compiler optimization. Compiler optimization automatically alters the machine code to enhance execution speed, potentially modifying distinctive stylistic traits of the original code and thereby complicating the identification of the author’s unique style.
Outlook
OCEAN expands the horizon for digital forensics by highlighting programming style as a unique marker for cyber defense. Traditional methods have often been limited to source code analysis, but this new approach demonstrates that binary files can also contain valuable information about their developer, even in a realistic ‘open world’ scenario.
The ability to identify the creator of malware based on coding style opens entirely new avenues in IT security. Cybercriminals inevitably leave behind individual stylistic traces when developing malicious software. With OCEAN, it becomes possible to link seemingly independent attacks and pinpoint potential culprits. For investigative authorities, this method could soon serve as a crucial forensic tool in court proceedings.
Beyond tracing cyber attacks, this technology opens up new possibilities for answering fundamental questions about code provenance and attribution. In the future, OCEAN might even help distinguish automatically generated code from human-written code, an aspect that is particularly relevant when considering accountability in AI-assisted programming.
At the same time, this technological advancement raises critical concerns. The capability to link individual programming styles could potentially be misused by corporations or government agencies to surveil developers. Thus, protecting privacy and ensuring digital anonymity remain as important as enhancing IT security.
The subtle nuances in coding styles have already proven to be reliably detectable, bringing both new opportunities and challenges. The future of digital forensics will depend not only on technological progress but also on a responsible approach to harnessing these new capabilities.
About the Author
Felix Mächtle is a researcher at the University of Lübeck specializing in IT security and machine learning. He is pursuing his PhD at the Institute for IT Security, focusing on machine learning, code analysis and cybersecurity. Additionally, Felix is a member of the research network AI Grid, an initiative that promotes the exchange between young talents and leading AI experts. In the micro-focus group for IT security and AI, Felix works with other researchers in the AI Grid to promote the exchange of innovative approaches to protecting against digital threats.