ADELPHI, Md. (Jan. 15, 2016) — Literature critics may know a writer by his style, in the same way a chunk of computer code is identified through a machine learning algorithm according to its writer’s nuances.
Writing style extends beyond prose, so that even in computer languages you could attribute work to its author in minutes with near perfect accuracy – in a lab.
That is what a team of university students tested during their time at the U.S. Army Research Laboratory, or ARL, said Richard Harang, ARL network security researcher and technical lead. “A tool kit that may one day help analysts to identify malware authors more quickly.”
The code stylometry study that was presented by Aylin Caliskan-Islam at the 32nd Chaos Computer Conference looked at samples from 1,600 coders and, with 94 percent accuracy, could determine the author of a particular code excerpt. In a “top five suspects” match, the precision was near perfect.
The research also examined executable binary authorship attribution from the standpoint of machine learning, using a novel set of features that include ones obtained by decompiling the executable binary to source code, showing that many features in source code could be extracted from decompiled executable binary, according to their recent paper, When coding style survives compilation: De-anonymizing programmers from executable binaries.
The team, including Caliskan-Islam, a Princeton University post-doctoral candidate, who started working on the project as a graduate student; Fabian Yamaguchi, from the University of Gottingen; and Edwin Dauber from Drexel University, is trying to address the problem of identifying the author of malicious code and software.
The next step in this fundamental research will be to extend the current result to more flexible working conditions.
“Attribution is a real challenge [as opposed to detection], as it is done manually by experts who have to reconcile forensics following an attack,” Harang said. “Currently, human analysis is the common tool. It works, but it can be slow and take a lot of resources. We are developing a toolkit to make it a lot faster and cheaper to support analysts in identifying bad actors.”
A limitation is that success is dependent upon having existing samples from potential authors. Another challenge for the future is to consider the tricks used by malware authors to heavily obfuscate, or mask, the software, as well as to extend the experiments to code written by multiple authors.
The goal for ARL is to develop basic and applied science and tools to defend Army networks, said Jerry Clarke, chief of ARL’s Network Security Branch.
This is fast-moving research and the study is making strides, Harang said.
They have demonstrated that authorship attribution can be performed on real-world code found “in the wild” by performing attribution on single-author GitHub repositories, according to the findings.
“This basic research shows that identifying authors of computer programs based on coding style is possible and worth pursuing,” Harang said. “This is collaborative research that builds upon a lot of good work before us.”
Professor Rachel Greenstadt at Drexel has been very active in this research, as well as contributions from Professor Arvind Narayanan at Princeton and Professor Konrad Rieck from the University of Gottingen.
“We have a novel technique that moves the ball forward. But there is work to be done.”
The U.S. Army Research Laboratory is part of the U.S. Army Research, Development and Engineering Command, which has the mission to ensure decisive overmatch for unified land operations to empower the Army, the joint warfighter and our nation. RDECOM is a major subordinate command of the U.S. Army Materiel Command.