Program Identifiability: How easily can you spot your code?
Picking the right identifiability metricWe wanted a measure that shows how easily application code can be identified after it has been compiled. Some source code elements, including many identifiers and all strings, remain in bytecode after the source code has been compiled, while other source code elements such as statements and comments are usually removed during compilation.
An unfortunate limitation of CodeMatch is that it lumps strings and comments together because functionally these elements are similar, so we were constrained to consider them as one type of element. For determining identifiability it would be better to consider these two source code elements separately, but this is a limitation of the tool that we have to live with for now.
The measure of identifiability requires that we determine the percentage of source code elements that remain in an application’s bytecode. We also wanted to decompile bytecode back into source code and again determine the percentage of the source code elements from the original source code that can be found in the resulting decompiled source code.
Source code element metrics. To begin the identifiability measurement calculation, we need to know how many total elements exist in a program’s code and how many of those elements are uncommon. Uncommon elements are more helpful at determining the identifiability of the application because common elements, by definition, are found in many programs.
Obtaining these metrics for an application involved running two CodeSuite functions. A CodeMatch comparison of the application’s source code to itself gives a list of all source code elements in the application. There are three types of elements that we consider: comments and strings (str), identifiers (id), and statements (stmt).
The total number of source code elements of each type in a particular application is represented as SE(str), SE(id), and SE(stmt). SourceDetective is a CodeSuite function that takes all of the elements found by CodeMatch and searches for them on the Internet (“hits”), recording the number of times each is found. The Internet search hit count h is used to qualify the counts. In Table 2 below, these totals returned from CodeMatch and SourceDetective for the Android game OpenSudoku are shown. The numbers are taken from spreadsheets generated by CodeSuite.

In this case, elements with less than 25 hits were considered uncommon and good potential indicators of copying, and elements with 0 hits were considered unique. Future researchers may want to test a different threshold than 25 hits for labeling a source code element as uncommon, but this number worked well in these tests and in our experience. Obviously an element that cannot be found elsewhere through an Internet search (i.e., has 0 hits) is unique to that application.


Loading comments... Write a comment