The art of reverse engineering ANSI-C legacy code

November 02, 2012

USA Firmware-November 02, 2012

ANSI-C legacy code can look ominous with hundreds--perhaps thousands--of functions and data types. Using some simple techniques you can traverse what looks like an insurmountable task.

Take a deep breath, relax. It's not as bad as it might look. Yes, I know you would have done it differently. I mean look at all those, dare I say it, global variables? And did this guy ever think about commenting his work? How about a data flow diagram or two? Come on, throw a colleague a bone for crying out loud!

We've all seen it and will continue to see it. Perhaps the original design team's intentions were good but the schedule got in the way of using an up-front approach. Requirements were poorly defined. The boss was breathing down their necks to get moving! Of course we can't forget the occasional mastermind who decided to toy with you by making the code as wickedly obfuscated as possible--brouhaha!

Clearly many tools are out there to debug, de-obfuscate, prettify, visualize, and comment code. When reverse engineering any legacy code, those tools will make your learning curve shorter. Consider all your coding tools as tactical elements in support of the strategy I'm about to suggest.

So let's give this a shot.

Step 1: Start with the data and not the functions.
Step 2: Discover and document the public interface.
Step 3: Create data flow diagrams.
Step 4: Create functional flow diagrams.
Step 5: Rinse and recycle.

In this blog, I will cover Steps 1 and 2. Steps 3 through 5 are "left as an exercise to the student." Unless of course, the students wish it to be the exercise of the blogger.

Step 1: Start with the data and not the functions. Start with the most important structures, constants, macros and variables. The typical mistake people make is to start at main and just pound through the code. Sounds fun until you start tripping over all the data and don't really understand the flow.

My recommendation would be to move the data into one location for quick review. You could do this with just a copy/paste exercise, use a tool like Excel, Visio, or just pencil and paper. (Wait, make that graph paper! Don't you love graph paper? I can't get enough of the stuff. Clearly I'm aging myself.) Moving right along!

Step 2: Discover and document the public interface. I'm assuming we're dealing with ANSI-C in this step. All code no matter how poorly constructed has a public interface. Your public interface lies within the externally declared functions in header files.

Here's one method for uncovering the firmware's public interface. Remember, the linker is your friend, okay most of the time. Sometimes it's your worst enemy. But in our case, we are navigating a released work. Rename a header file, recompile, and link and… Bazinga! Out will fall part of your public interface.

Yes, you need to document these functions. If there's anything I've found that engineers hate to do, it's creating documentation. Case and point; my son is an electrical engineering student. On the back of his favorite T-shirt it reads "I am an engineer" spelled wrong four times followed by the words "I LIKE MATH."

With this in mind, you can always use Doxygen to kick start this effort. Doxygen is open source and can auto-generate documentation for all your data and functions in both a flat file format as well as HTML. I still see this more as an aid and less as a solution, since there is little learning when using self-generation of documentation. Retention will come more readily if you do it yourself.

For each function you should specify the file it is defined in, the function declaration, the definition of the function as well as the variables both passed and returned. Here is an example.

Declaration:

 float MdGetMeasurement(DEVICE_ENUM eDevice, RANGE_ENUM eRange);

Parameters:
1. RANGE_ENUM eRange: An enumeration of the available ranges for the device being measured.
2. DEVICE_ENUM dDevice: An enumeration of the type of devices for which a measurement is a characteristic.

Return:
l. A copy of the measured value in floating-point notation.

Definition:
This function performs all the necessary actions to retrieve the measurement of device type eDevice on the range eRange and return said measurement to the caller as a floating-point number of unspecified units.  

File:
Measurements.c

An important diversion to leave you with. As you can see I'm using verbose variable naming throughout my example. This is an extremely important good practice! I would argue that an important part of any design review should be the review of all the names. I would go so far as to say a measurable and significant chunk of time can be lost on a project simply because of poorly named types, constants, variables, structures, and functions. Bet you've been there and done that. See you in the trenches.

Robert Scaccia is a firmware consultant and president of USA Firmware, LLC. He is very active with a large network of firmware consultants throughout the United States and Canada, is chair of the IEEE Cleveland Computer Society, and founded the largest regional, as well as international, firmware group on LinkedIn. Email him at bob.scaccia@usafirmware.com, or visit www.firmwareplanet.com or www.usafirmware.com.

Loading comments...

Most Commented

Parts Search Datasheets.com

KNOWLEDGE CENTER