If someone stole your Android application, decompiled it, modified it, and sold it as their own, would you know? Now that most software is bought, sold, and downloaded online, this is becoming a growing problem. Smartphone apps and Facebook apps in particular can be very profitable, giving lots of reason to steal them and release them under a different title.
Applications written in traditional programming languages like C or C++ are compiled into ones and zeroes, but more modern programming languages like Java are compiled into bytecode, an intermediate form that contains more information about the program itself and is easier to decompile back to source code or reverse-engineer to understand how it works.
Also, the open nature of today’s platforms means that reverse engineering of applications is relatively easy, and many developers are concerned as applications similar to their own show up in the Android Market or Apple App Store. These developers want to know if their applications have been pirated. Fortunately, the same characteristics that make a smartphone app easy to reverse engineer and copy also provide opportunities for developers to compare downloaded applications to their own.
At Zeidman Consulting we have developed a process for comparing a developer’s application with a downloaded application, and we have defined an identifiability metric to quantify the degree to which an application can be identified by its bytecode or object code. Note that identifiability can be a positive or negative characteristic. A program that is easily identifiable after compilation may be easier to detect when it’s been pirated, even if it’s later modified. A program that’s difficult to identify after compilation may hide more of its trade secrets from reverse engineering and may be more difficult to duplicate.
Comparing and measuring . We decided to test our process on some popular Android games. Android applications consist of Java bytecode delivered in an Android Package file (APK), which is a compressed archive file. The bytecode for the application is found in the classes.dex file of the application APK. We decided to try two approaches for comparing the source code of the original app to the bytecode of the downloaded app: 1) compare the bytecode form of the downloaded app and 2) decompile the bytecode into source code and compare the decompiled source code form of the downloaded app (Table 1, below ).
Click on image to enlarge.
Table 1. Software source code elements.
Picking the right identifiability metric
We wanted a measure that shows how easily application code can be identified after it has been compiled. Some source code elements, including many identifiers and all strings, remain in bytecode after the source code has been compiled, while other source code elements such as statements and comments are usually removed during compilation.
An unfortunate limitation of CodeMatch is that it lumps strings and comments together because functionally these elements are similar, so we were constrained to consider them as one type of element. For determining identifiability it would be better to consider these two source code elements separately, but this is a limitation of the tool that we have to live with for now.
The measure of identifiability requires that we determine the percentage of source code elements that remain in an application’s bytecode. We also wanted to decompile bytecode back into source code and again determine the percentage of the source code elements from the original source code that can be found in the resulting decompiled source code.
Source code element metrics. To begin the identifiability measurement calculation, we need to know how many total elements exist in a program’s code and how many of those elements are uncommon. Uncommon elements are more helpful at determining the identifiability of the application because common elements, by definition, are found in many programs.
Obtaining these metrics for an application involved running two CodeSuite functions. A CodeMatch comparison of the application’s source code to itself gives a list of all source code elements in the application. There are three types of elements that we consider: comments and strings (str), identifiers (id), and statements (stmt).
The total number of source code elements of each type in a particular application is represented as SE(str), SE(id), and SE(stmt). SourceDetective is a CodeSuite function that takes all of the elements found by CodeMatch and searches for them on the Internet (“hits”), recording the number of times each is found. The Internet search hit count h is used to qualify the counts. In Table 2 below, these totals returned from CodeMatch and SourceDetective for the Android game OpenSudoku are shown. The numbers are taken from spreadsheets generated by CodeSuite.
In this case, elements with less than 25 hits were considered uncommon and good potential indicators of copying, and elements with 0 hits were considered unique. Future researchers may want to test a different threshold than 25 hits for labeling a source code element as uncommon, but this number worked well in these tests and in our experience. Obviously an element that cannot be found elsewhere through an Internet search (i.e., has 0 hits) is unique to that application.
Baseline for comparing bytecode.
Next another CodeSuite tool, BitMatch, was run to compare the application’s source code with its own bytecode file (classes.dex ). Then SourceDetective was run to generate the report of hits. This information is needed to create a baseline to quantify our likelihood of identifying copied code because we cannot expect better results comparing a downloaded application’s bytecode to the original application’s source code than when comparing the original application’s bytecode to its own source code.
Because CodeSuite provides these metrics by element type, it is valuable to define the identifiability metric by type as well as defining the total identifiability. The number of source code elements that are also found in the application’s bytecode are represented as BE(str), BE(id), and BE(stmt).
The bytecode identifiability IB is the number of elements that can be found in the application’s bytecode as a percentage of the total number of those elements in the application’s source code. The Internet search hit count h is then used to qualify both the element (BEh) and total (SEh) counts so that identifiability can be determined for elements with h or fewer hits. Source code elements with high hit counts do not help uniquely identify an application while those with low hit counts do.
The formulas for calculating the identifiability of an application’s bytecode are:
IBh(str) = BEh(str)/SEh(str)
IBh(id) = BEh(id)/SEh(id)
IBh(stmt) = BEh(stmt)/SEh(stmt)
Table 3 below shows the results for the analysis of the bytecode for the Android game OpenSudoku. As expected, because the bytecode does not include statements or comments from the source, the identifiability for statements was 0 and the comment/string identifiability comes only from strings.
However, the identifiability for identifiers was high, which means that if code was copied, the comparison of bytecode with source code is very likely to find the copying. In addition, the coverage of unique identifiers (~90%) means that the compiling and packaging process did not eliminate many unique identifiers.
Baseline for comparing decompiled source with original source.
We can measure the identifiability of the decompiled bytecode using the same methodology by comparing the application’s source code to the source code from its decompiled bytecode. To get source code from the bytecode dex file we used the JD-GUI decompiler. The decompiler works with either a Java archive (JAR) file or bytecode class file, so the free dex2jar utility was used to first generate a JAR file from the classes.dex file.
CodeMatch was used to compare the application’s original source code with its decompiled source code. Then SourceDetective was run and the decompiled code identifiability metrics were calculated. The number of source code elements that are also found in the application’s decompiled bytecode are represented as DE(str), DE(id), and DE(stmt) .
The identifiability ID is the number of elements that can be found in the application’s decompiled bytecode as a percentage of the total number of those elements in the application’s original source code. The Internet search hit count h is then used to qualify both the element (DEh) and total (SEh) counts so that identifiability can be determined for elements with h or fewer hits. Source code elements with high hit counts do not help uniquely identify an application while those with low hit counts do.
The formulas for calculating the identifiability of an application’s decompiled bytecode are:
IDh(str) = DEh(str)/SEh(str)
IDh(id) = DEh(id)/SEh(id)
IDh(stmt) = DEh(stmt)/SEh(stmt)
IDh = (DEh(str)+DEh(id)+DEh(stmt))/(SEh(str)+SEh(id)+SEh(stmt))
The surprising result seen in Table 4 below is that decompiling actually improved the total identifiability. Comparing an application’s source code with the source code that is decompiled from a suspect application’s bytecode appears to be a slightly better way to detect copying than to directly compare an application’s source code to a suspect application’s bytecode.
Clickon image to enlarge.
This is because bytecode decompilation produces source code statements that can then be compared, increasing the identifiability. And in general, being able to view source code will give you a better understanding of the context of any matching source code elements. The case study below can better illustrate how decompiling helps.
(Sidebar : deconstructing the decompilers )
Based on the results in the article it’s clear the decompiler did not fully recreate the source code, so we wondered how good are the compiler and the decompiled code?
Table 5 below shows the results of compiling the decompiled source code generated by the JD-GUI decompiler and the popular JAD decompiler for three different applications. These results illustrate that the decompile process often does not produce source code that can be compiled or executed.
Table 5: Evaluating decompilers by attempting to compile and run decompiled code
In this article I defined a process and an identifiability metric for a software application that can give a developer an idea of how easy or difficult it is to identify bytecode as being derived from an application’s source code. A high identifiability means it will be easier to detect that another application was derived from or copied from the application.
A low identifiability means that an application's trade secrets are better hidden. One surprising result was that it was slightly more effective to use decompiled bytecode rather than bytecode in the comparison. It seems that decompiling puts information back into the code that is in the bytecode but difficult to identify.
When one application is suspected to be a copy of another application, CodeSuite can be used to compare source code to decompiled bytecode to help determine whether copying occurred.
The identifiability of an application can be calculated to determine how well or how poorly to expect CodeSuite to perform when comparing the two applications. If you want to be able to determine when your code is stolen, a high identifiability is best. If you want to be able to hide your software trade secrets and protect your software copyrights, a low identifiability is best.
Bob Zeidman is a Senior Member of the IEEE, the president of Zeidman Consulting, and the president of Software Analysis and Forensic Engineering. Among his publications are technical papers on hardware and software design methods as well as four textbooks – The Software IP Detective’s Handbook, Designing with FPGAs and CPLDs, Verilog Designer's Library, and Introduction to Verilog. He has taught courses at engineering conferences throughout the world. Bob holds fifteen patents. He earned a master's degree in electrical engineering at Stanford University and bachelor's degrees in physics and electrical engineering at Cornell University.
This article is based on research done by Larry Melling at Zeidman Consulting. The author would like to thank Larry as well as Jim Zamiska and Nik Baer, also of Zeidman Consulting, for their detailed reviews and their in-depth discussions about the metrics and methodologies developed in this paper.
 Altova Site, “DiffDog – XML-aware diff merge tool for file, folder, directory, and database differencing,” http://www.altova.com/diffdog/diff-merge-tool.html, retrieved September 16, 2011.
 Android Developer Site, “Application Resources,” http://developer.android.com/guide/topics/resources/index.html, retrieved September 16, 2011.
 Android Developer Site, “The AndroidManifest.xml file,” http://developer.android.com/guide/topics/manifest/manifest-intro.html, retrieved September 16, 2011.
 Ciancarini, Paolo and Favini, Gian Piero (2009). “Plagiarism detection in game-playing software,” Orlando, FL: ICFDG. April 26-30, 2009.
 Google Code Site, Andoku, http://code.google.com/r/ardorleo-p-andoku, retrieved April 17, 2012.
 Google Code Site, Android-Apktool, “A tool for reengineering Android apk files,” http://code.google.com/p/android-apktool, retrieved September 16, 2011.
 Google Code Site, dex2jar, “A tool for converting Android's .dex format to Java's .class format,” http://code.google.com/p/dex2jar, retrieved September 16, 2011.
 Google Code Site, OpenSudoku-Android, “Sudoku for android,” http://code.google.com/p/opensudoku-android, retrieved September 16, 2011.
 Hornshaw, Phil (2011). “Game developers struggle with piracy, malware in Android Market,” Appolicious Advisor, March 18, 2011.
 Java Decompiler, “Yet another fast Java decompiler,” http://java.decompiler.free.fr, retrieved September 15, 2011.
 Kalinovsky, Alex (2004). Covert Java: Techniques for Decompiling, Patching, and Reverse Engineering, Sams Publishing. May 13, 2004.
 Melling, Larry and Zeidman, Bob (2012), “Comparing Android Applications to Find Copying,” Journal of Digital Forensics, Security and Law, Vol. 7 (2).
 Paller, Gabor (2009). “Understanding the Dalvik bytecode with the Dedexer tool,” December 2, 2009.
 Schulman, Andrew (2005). “Finding Binary Clones with Opstrings & Function Digests: Part I,” Dr. Dobbs Journal,
 Schulman, Andrew (2005). Finding Binary Clones with Opstrings & Function Digests: Part II,” Dr. Dobbs Journal,
 Schulman, Andrew (2005). Finding Binary Clones with Opstrings & Function Digests: Part III,” Dr. Dobbs Journal,
 Software Analysis & Forensic Engineering Corporation, “CodeSuite User’s Guide,” retrieved September 15, 2011.
 Software Analysis & Forensic Engineering Corporation, “Our Process,” retrieved September 16, 2011.
 Varaneckas, Toma. “JAD Java Decompiler Download Mirror,” http://www.varaneckas.com/jad, retrieved September 16, 2011.
 Zeidman, Bob (2011). The Software IP Detective’s Handbook, Westford, MA: Prentice Hall, 2011.
 Zeidman, R. (2006). ‘Software Source Code Correlation’. 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, Software Architecture and Reuse (ICIS-COMSAR'06). August 10-12/2006. Honolulu, HI.
 Zeidman, R. (2008). ‘Multidimensional Correlation of Software Source Code’, Third International Workshop on Systematic Approaches to Digital Forensic Engineering. 5/22/2008. Oakland, CA.