(Editor’s Note: Li Mei is a character in a non-fiction book by Lisa Simone about the adventures of a fictional team of software developers working on various project and the lessons they learned.
In the book, Li Mei develops a List of Debugging Secrets while debugging her own mysteries, participating as a sounding board for her teammates as they struggled with their debugging challenges, and listening to everyone’s “Lessons Learned” at the team’s regular status meetings, The action plan she develops includes:
1. Gather the Facts – Learn all you can before diving in.
2. Classify the Symptoms – Characterize the bug based on the facts.
3. Brainstorm Root Causes – Identify what could cause each symptom.
4. Understand the System – Target where to search based on possible root causes.
5. Hypothesize, Test and Verify – Be logical and methodical – nail the bug!
What follows here is the same action plan – but filled in with truisms she experienced – “Resist the Urge!”, “Think with your brain, not your debugger.” – and specific guidelines for writing better code.
It concludes with a list of symptoms and bugs gathered from the individual mysteries the team solved. Use it as a reference for your own work, and as a starting point for your own list of hard-earned Debugging Secrets .)
Step #1: Gather the Facts
* Interview problem reporter.
* Interview anyone who saw the system fail.
* Observe the system behavior – find out what is normal for the system.
* Isolate relevant facts about product, customer, environment, hardware/software, materials used, priority, safety.
* Reproduce the problem if possible.
* Realize bug report descriptions can be misleading.
* Be wary of assumptions when gathering facts – identify inputs and symptoms only. Draw no conclusions yet.
Step #2: Classify the Symptoms
* One-time repeatable events – occurrence has a pattern (e.g., only start up, different behavior first time through function or feature ).
* Periodic events – regularly repeating or occurs every time (e.g., tied to timer, interrupt, repeated calls to function/feature, HW or SW heartbeat ).
* Sporadic events – seemingly random failures (e.g., boundary condition violations, parameters that change with state, loop counter limits, ranges in logicals, unexpected input/output conditions, unhandled error conditions, faulty logic, hardware, timing, memory corruption, performance issues ).
Step #3: Brainstorm (Initial Root Causes, and When You’re “Stuck”)
* Dream what could cause each symptom. The bug’s location in the code can sometimes be determined before looking at the software.
* Identify the set of inputs that causes unacceptable behavior, then find what must be changed to make the behavior acceptable.
* Create a truth table to classify inputs (user, hardware, software, configuration, etc.) and resulting behaviors.
* Patiently watch the system’s behavior.
* Periodically stop and summarize findings.
* Find a sounding board (doesn’t have to be a live person!) – talk through ideas to quickly identify good ideas and discard bad ones.
* Consult the gurus.
* Talk to internal groups (engineering, testing, marketing, etc.) and external groups (vendors, customers, beta testers, etc.).
* Go home. Or somewhere else. Let your brain chew at the problem while you’re doing something else.
Step #4: Understand the System (hardware, software, mechanics)
* Focus on understanding main() first for overall program flow – don’t get lost in the details.
* Divide the code into logical chunks based on structure and flow.
* Use visual aids (flowcharts, graphs, function call trees) to show functional elements (blocks) and program control logic (connectors). This reveals what the program does, reveals structure and testing, and identifies missing logical and functional elements.
* Play Computer. Doggedly step through the code line-by-line because sequential logic performed by a computer does not always match human assumptions. (Trace assembly language noting the contents of registers and stack pointer.)
* Debugging tools allow simple timing characterizations without stopping the program execution.
* When reverse-engineering code, check off functions (e.g., know where you’ve been, identify unused functions ).
* Remember sometimes the comments are wrong.
Step #5: Hypothesize, Test and Verify
* Decide exactly what information you would like to get from the embedded system, and choose the best tool accordingly.
* Consider nontraditional and low tech debugging methods (e.g., auditory, pin wiggling).
* Simple methods like pin wiggling can be powerful and unobtrusive.
* Auditory cues – sense of hearing – can discern fine differences in tone and rhythm, and can also be used as a heartbeat or a code coverage flag.
* Use black-box testing to fully explore behavior without looking inside the device/software.
* Hypothesize what should happen in order to choose which variables to watch.
* Apply stressors to induce rare bugs to occur more frequently (more/larger inputs, increased loading, larger memory allocations, faster timing, etc. ).
* Use breakpoints configured as watch points to check when values of variables and memory locations change without stopping code execution until the exact conditions you specify are met.
* Use patterned memory (e.g., DEADBEEF, 0x55) to verify memory operations and to identify memory overruns.
* Test bug fix with methods originally used to induce the bug in the first place.
* “Resist the Urge!” to jump blindly into the software listings.
* Think with your brain, not your debugger.
* Just because it compiles doesn’t mean it works.
* Be clever.
* Don’t worry about what other people think. Just worry about getting it fixed.
* Don’t make assumptions about how something was implemented (e.g., in hardware versus in software). Look at the evidence and the documentation.
* Commercial tools can contain bugs; if a tool does something strange, suspect the tool.
* When you feel overwhelmed, take things one step at a time.
* Randomly changing lines of code is not an effective debugging technique!
Making Better Software
* Use programming elements that are appropriate for the function (e.g., switch for unrelated discrete items and if-statements for continuous variables).
* Make code self-documenting with descriptive names, #defines, and comments.
* Consistent tab spacing and white space make code more readable.
* Sometimes hardware-dependent software is unavoidable; if so, document heavily.
* Coding standards are a good source of bug types and causes, and also to search for possible fixes!
Mystery-Specific Symptoms and Bugs
* Initialize all variables. Don’t assume the compiler will do it for you.
* Avoid using hard-coded numbers. If a #define or enum is available, use it. If not, create one with a descriptive name only if you are SURE what it does.
* Make sure unsigned variables will not be used to store signed quantities.
* Suspicious indexes into arrays often cause off-by-one or rollover errors.
* Reduce overhead when transmitting data to conserve battery power.
* Sending more than one data sample at a time and/or compressing data in the message payload can radically increase the life of battery-operated devices.
* Remember to turn off debugging code before shipment!
* Test scripts should use existing functions to duplicate normal device operation. If a test-only function must be created, document what it does and why it is there.
* Inlining code can improve processing speed.
Hardware and Timing
* Document any underlying hardware assumptions.
* Any software changes to a hardware interface or external device should be verified against the timing requirements for the device.
* Writing to RAM on external devices takes longer than writing to RAM on the onboard device. From software, hardware can look like a variable – don’t treat it like one.
* When controlling motors, use absolute (rather than relative) motor position references for known, repeating activities. Send the motor home regularly.
* Don’t assume the motor does what you tell it. Check it.
* For proportional errors (twice as fast, off by 3, etc.) check if data are periodically missed, two entities expect data at different rates, or a configuration setting is bad.
Interrupt Service Routines (ISRs)
* Make sure only time-critical functions are included in Interrupt Service Routines. Verify the duty cycle.
* Split functions or use flags to remove noncritical code.
* Only turn interrupts off for a very short time; otherwise, latencies increase. Logic
* When using flags for signaling, ensure the entities check the state first to avoid missing information.
* RTOSes can allow priority inversion if tasks of different priority levels can access the same common system resources (e.g., memory).
* Check logic to ensure counting variables are bounded.
Memory and Stack
* When something works for a while before breaking, suspect memory problems like boundary condition violations.
* Ensure variables stored in two locations (like RAM and FLASH) are synchronized.
* Perform stack analysis early and often to understand memory usage. When stack size is limited, or when user variables are allowed to share RAM with the stack, stack overflow can cause catastrophic and unpredictable results.
* Suspect stack corruption when problems manifest in deeply nested functions or when a lot of data is passed between functions.
Lisa Simone is an technology and embedded systems expert in several fields including medical, consumer, wireless telecommunications, industrial automation and human performance assessment. She has designed embedded devices and led international teams from pure research and product concept through delivery.
She has received federal and state grant funding for rehabilitation and wearable systems, and developed and taught university level engineering design courses. She is also a published author in several areas. She has authored peer-reviewed engineering and medical articles and published a fiction book of mysteries. She is also the author of several blogs.
This list of programming tips is based on material – printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2007 – from “If I only changed the software, why is the phone on fire?” by Lisa Simone. For more information about this title and other similar books, please visit www.elsevierdirect.com.