A Feynman approach to debugging - Embedded.com

A Feynman approach to debugging

Click here for response to this article

As a child, Richard Feynman could fix radios by thinking. You can take the same approach to debugging embedded systems.

As a child, Richard Feynman, scientist, author and Nobel laureate, was asked to fix a radio that emitted a terrible cacophony when it was first turned on. As he paced back and forth without touching the radio, trying to figure out what could cause such a loud noise, it dawned on him that the amplifier tubes were warming up at different rates before a radio signal was available. To avoid amplifying noise, he simply rearranged the tubes and the radio started up perfectly with no noise. The radio's owner was amazed that “he fixes radios by thinking!” Effectively debugging embedded systems requires an analytical thinking approach that's not generally taught after the student learns a new programming language. These skills are often more valuable than straight programming skills, especially when projects contain a significant amount of legacy code or the original author is no longer available for consultation. Acquiring these analytical skills comes from experience: learning to think logically about the symptoms and recognizing a bug that has bitten you previously. In this article, you'll be challenged to solve a real problem using some simple guidelines.

The scenario
You're an embedded systems debugging contractor, and you receive an e-mail describing your next assignment. You're to investigate a problem at the Industrial Enclosures Company (IEC), and the e-mail contains a brief overview with few concrete details:

“On Tuesday afternoon, a small oven on IEC's manufacturing line malfunctioned. Rather than heating components to a predefined temperature, it got too hot and damaged the components. The manufacturing line was stopped for the remainder of the day.”

The e-mail further relays that the oven started to work correctly this morning, but the management cannot accept any further down-time of the manufacturing line without compromising a delivery deadline to a customer. The production manager will be calling you with more details, and tomorrow morning you're expected to report to IEC.

The action plan
Whether you're familiar with the code or not, the steps to isolate and identify the bug are largely the same. The process is iterative: exploring different sources of information, generating (more) questions, summarizing what you know, and brainstorming possible root causes. Several common sources of information are:

  • The problem report and background information on the system in question
  • Whether the problem report is correct—how was this information collected?
  • The person who reported the problem and other knowledgeable parties
  • Observing the system (running correctly or reproducing the problem if possible)
  • Understanding what is “normal” and what are critical system operational parameters
  • Understanding the algorithm or method used
  • The software listing
  • Debugging the system (running correctly or reproducing the problem if possible)

We'll use these sources as logical steps to guide the debugging process.

Background and report
Later on Wednesday afternoon you receive a call from Sophie, who introduces herself as the production manager of IEC. She gives you a brief overview of the company. “Industrial Enclosures manufactures custom plastic enclosures for different types of industrial equipment. We produce more than 10,000 enclosures of several different types each year. The manufacturing line that failed yesterday is used to assemble enclosures with air-tight compartments.”

Sophie explains that a special material is used to construct these compartments, “The manufacturer of this component material has specified its temperature characteristics—I e-mailed the specifications to you. The operating range is between 117ºF and 165ºF, and the recommended nominal operating temperature is 126ºF. The manufacturer also warns that the structural properties of the material are compromised when it's heated over 189ºF.” Sophie's e-mail contains the manufacturer's specifications, as shown in Table 1.

Table 1: Temperature characteristics for component material

Temperature Component Limits
117ºF Minimum temperature
126ºF Nominal temperature
165ºF Maximum temperature
189ºF Material loses structural integrity

Sophie continues, “We use a small oven specially designed to heat the components to 126ºF so that it makes a proper seal with the underlying structures. However, the oven didn't shut off when it reached 126ºF and it heated the components to the point that they were unusable. The oven has a microcontroller to control the temperature but the engineer who designed the oven and the software is no longer with us. I'm counting on you to identify the cause of this failure.”

Sophie instructs you to meet with BJ when you arrive the next day. Your job is to find out what went wrong and to correct the problem before it happens again.

Before you go further, it's a good idea to understand the normal operation of the failing system and think about ways it could fail. This brainstorming technique can provide useful questions for your interviews. What is the system supposed to do? How might it have been designed? What hardware or software could be involved?

A basic system requires a heating element, a temperature sensor, and an algorithm to turn the heater on and off. If the measured temperature is below the operating point (which appears to be 126ºF), the heater is turned on. When the temperature reaches 126ºF the heater is turned off. If the temperature is above 126ºF, the heater does not turn on. The actual algorithm may be more complicated than this, but these are basic required elements. With these system basics in mind, some obvious points of failure are listed in Table 2.

Table 2: Some common failure points for temperature-control systems

Element Possible failure
Heating element Element failed
Element displaced
Temperature sensor Sensor failed
Sensor displaced
Temperature control algorithm Set to wrong temperature control point/calibration
Temperature-sensing hardware failed
Heater ON/OFF control Heating element driver (solid state relay, for example) failed
Heating logic signal incorrect

Interview witnesses
Sophie told you all she knew about the problem, but without more information it's difficult to determine what really happened. BJ is your next source of information as the person operating the machine. On Thursday morning you introduce yourself to BJ on the manufacturing floor and then get to work by asking for all he knows about the machine and the failure.

BJ explains, “The oven normally takes a couple of minutes to heat the material to the proper temperature, and when the component is hot enough, the oven automatically turns off, and the component moves out of the heating area. I noticed that heating started to take a lot longer, almost five minutes. When I looked inside, the oven was very hot so I hit the emergency shutoff. When I picked up the component with tongs, the tongs made imprints in the material. It's not supposed to do that. It looks like the oven didn't turn off when it was supposed to. I have been running this machine for about nine months so I know what it's supposed to do.” BJ hands you one of the overheated components from Tuesday. It looks slightly deformed when compared to a properly heated component.

After BJ has provided all the information he can think of, don't walk away! Keep asking questions to tease out more information. Is the problem reproducible? Does it happen all the time?

“Well, for about an hour on Tuesday afternoon the oven continued to fail, and we had to shut the line down for the day. Wednesday morning it started working again and worked without failure the entire day.”

Listen to BJ's words and refine your questions, “BJ, you said that the oven 'continued to fail.' Does this mean it overheated the component every single time or just every now and then?” BJ replies that after the first failure, the next two were also overheated. Since the components are not cheap, the line was shut down.

“BJ, did you do anything differently the next morning that might have fixed the problem?” BJ replies that he just started the machine for a trial run and that it worked fine the first time, so they restarted production. It hasn't failed since.

Continue to ask questions about the machine, including the failure points you identified earlier. Even though the machine appears to be running correctly now, don't discount these common sources of failures.

“Is it possible the heating element failed or moved inside the oven?”

“No, the heater is solidly mounted and pretty rugged.”

“What about the temperature sensor? They're generally more fragile. Could it have moved too far away and measured a lower temperature? That could cause the heater to overheat.”

“I replaced the permanently mounted temperature sensor and the oven still overheated.”

“Has the oven ever overheated before when it was first turned on?”

“Well, once when we had a problem calibrating the temperature sensor the oven did overheat that time. I checked the oven temperature this morning with a different temperature probe, and it's within one degree of normal.”

“Is there a temperature set-point knob or any other adjustments on the machine that could have been moved? Any recent maintenance on the machine?”

“No maintenance lately. Here's the temperature set reading; it's set to 126ºF.”

“BJ, I'm not sure what hardware controls the heater but something like a solid-state relay. Could that be bad?”

“Well, it's possible, but it appears to be working now. If the problem happens again we can check it.”

“Was any new software installed on the machine?”

“No.”

“Did someone else operate the machine?”

“No.”

Observe the system
Since the manufacturing line is now working, BJ offers to give you an example of normal operation. As he has described, the next component is inserted into the oven. After a short time the oven shuts off and the component is ejected from the oven. BJ picks up the heated component and inserts it into a larger assembly. As he works, you begin to review all that you have learned about the problem.

At this point, classify the symptoms. Experience can help you predict the root cause of a problem by correctly characterizing its symptoms. While space limits a full exploration into cause and effect, consider these categories and try to classify the symptoms you've documented.

One-time repeatable events
These symptoms occur once, but have a pattern to their occurrence. They might occur only at power-up or the first time through a function or feature. Or, a function may work correctly the first time but fails all subsequent times.

Periodic events
These symptoms occur several times in a somewhat repeatable manner.

Sporadic events
These may happen once in a hundred tries or so randomly that it's hard to relate the occurrence to the software.

How would you classify the oven symptoms? The oven failures are best described as sporadic events because the oven stopped working and then inexplicably resumed normal operation the next morning and hasn't failed since. Sporadic events are harder to find and fix because we design systems to behave in a repeatable manner. When they don't, some of our assumptions are generally incorrect. Some root causes of sporadic events are:

  • Violation of boundary conditions in the software
  • Unexpected input or output conditions (software, hardware, or material)
  • Unhandled error conditions or faulty logic
  • Logic based on time-of-day
  • Memory corruption
  • Performance issues
  • Intermittent electrical or mechanical connections; impending component failure

Analyze the symptoms
With these types of root causes in mind, can you think of anything special about Tuesday's failure that might have caused the problem?

  • Failure occurred in the afternoon—does the system have a real-time clock?
  • Was a different material used for the components?
  • If the software hit some kind of error condition that affected the heating algorithm, did turning the machine off and on again solve the problem?

    As a result of this brainstorming, you have a few more questions for BJ. He tells you he power-cycled the machine and it didn't fix the problem. The components are all made of the same material.

We're not quite ready to jump into the code listings yet. Try to resist that urge just a bit longer! This up-front analysis can actually reduce the time you spend randomly digging through software because it enables us to perform a more directed and methodical search.

Remember our initial brainstorming about how the system may have been designed? Think about the system elements again, focusing on less obvious causes of the failure.

  • The A/D converter subsystem failed, causing inaccurate temperature readings
  • The logic to determine if the oven has reached 126ºF is faulty
  • The output control signal that turns the heater off failed
  • Temperature or control variables not initialized or possibly corrupted or overwritten

Target your software search
Searching the code for words like “temperature” and “oven” in the software, you find several references that appear relevant. These code fragments are shown online at ftp://ftp.embedded.com/pub/2004/12simone.

  • periodic_timer ()
  • read_actual_temperature_A2D() and read_reference_temperature_A2D()
  • calculate_new_oven_ON_time()
  • oven_ON_time_control_routine()

You also learn that the processor is an 8-bit microcontroller, the program is written in C with some assembly code, it has a simple floating-point math function for small numbers, and it has one interrupt-service routine. We'll assume that system performance and resources aren't overloaded in any way.

Look at each of the routines and try to decipher what they do, assuming that the comments, function names, and variables are not misleading.

The periodic_timer() function is a good place to start since it'll tell us what happens and how often it happens. A routine that controls the oven on-time is called every 10ms, and new temperature values are read every second and used to adjust the oven control.

Next, look at oven_ON_time_control_routine() to understand the actual control of the oven heater. A variable with the name of oven_pulse_width_counter is incremented each time this function is called, and it's reset to 0 after it reaches 100. The oven is turned on when the counter is less than ovenPW and is turned off all other times. Do you recognize that this control is pulse width modulation? The heating cycle is always 1s (100 function calls x 10ms/function call), and the longer the oven heater is on during this one second, the hotter the oven can get and the more rapidly it can heat up. The variable ovenPW is the duty cycle of this signal and also represents the percentage of time that the oven is on during each one-second interval. A sample picture of this signal is shown in Figure 1 where ovenPW is equal to 33 and the oven is on 33% of the time.


Figure 1: Pulse width modulation of the oven on signal

Something should bother you about this function. Can oven_pulse_width_counter ever reach a value of 101? If this occurred, would the oven turn off properly? Checking counters this way is dangerous because any other function in the program could set this variable beyond 100. If this occurred, the heater wouldn't turn on again until variable oven_pulse_width_counter had been incremented all the way to 255 and then rolled over to 0. (How would you change the function to ensure that this never happened? A simple solution is to change the logic to check for values above as well as equal to the limit.)

How is the oven on-time is calculated? The function calculate_new_oven_ON_time() contains a math calculation and some boundary checking. Combining the equations and substituting #defines , the ovenPW becomes:

ovenPW = 17 + 2.7 x (delta_temperature)

First, a delta-temperature value is computed using a reference temperature that might represent a calibration point or a nominal temperature, but we don't know yet. The delta value is then used to calculate ovenPW using a linear equation with a slope of 2.7 and an intercept of 17. After the ovenPW value is computed, it's truncated to within the range of 0 to 100. This boundary checking confirms our suspicion that the ovenPW value is a percentage; the oven on times will range between 0% and 100% of each second.

The software should raise additional questions that could be useful to your investigation:

  • The ovenPW increases if the actual temperature is greater than the reference temperature; is this logic backwards? Since we don't know how the temperature circuit was designed, we can't answer that question now, but keep it in mind for later.
  • Does the intercept value represent the required ovenPW for the nominal temperature of 126ºF? You might plot the equation to help visualize how the system works, as shown in Figure 2. The intercept occurs where delta_temperature is equal to 0.


Figure 2: Linear equation relating delta temperature and ovenPW

Summarize targeted search
After reviewing your notes over lunch, you conclude that the oven is controlled by a digital (on/off) signal, and that the temperature is controlled using pulse-width modulation. The on-time is computed from the actual temperature and a reference temperature, and that the nominal on pulse width is 17%, which most likely corresponds to the nominal temperature of 126ºF. You've also identified more questions.

Debug and observe
With your new understanding of the oven-heating algorithm, it's time to monitor the system in operation and test some of the hypotheses. Several techniques can give you visibility into the code as the system is running. Let's assume you have a debugger or monitor and can access the value of software variables in real time.

Choose to monitor variables that will allow you to answer your outstanding questions. Logical choices would be ovenPW , actual_temp_A2D_counts , and reference_temp_A2D_counts . These will allow you to verify proper calculation of the A/D subsystem and the oven pulse width. You could also include Heater_output_pin to verify that the digital heater output signal is correct.

It's now well after lunchtime so you head back out to the factory floor to run some tests. Sophie has provided a laptop computer with a real-time monitor for the oven microcontroller. After you set up the debugging environment, BJ starts the line and the first component enters the oven. BJ has placed a temperature sensor on the component in the oven so you can record the actual temperature at the surface of the component. As the oven turns on, you begin recording data as shown in Table 3 and plotted in Figure 3.


Figure 3: Measured A/D values and resultant ovenPW

Table 3: Data collection trial 1

Measured values
Calculated values



actual_temp_
A2D_counts
reference_temp_
A2D_counts
Heater_
output_pin
Independently
measured temperature
delta_
temperature
calculated
ovenPW



[A/D counts]
[A/D counts]
1=ON,
0=OFF
[ºF]
[A/D counts]
[%]
85
70
1
81
15
57
83
70
1
87
13
52
81
70
1
93
11
46
79
70
1
99
9
41
77
70
1
105
7
35
75
70
1
111
5
30
73
70
1
117
3
25
71
70
1
123
1
19
69
70
0
129
-1
17
69
70
0
130
-1
17

Immediately you learn several things from these data:

  • The variable reference_temperature_A2D_counts is always equal to 70. This corresponds to an ovenPW of 17 and a temperature of 126ºF, confirming that this value represents the nominal operating temperature of the material in raw A/D counts. This value is the temperature set point.
  • You can also confirm that the raw A/D values do decrease for increasing temperature values, so that mystery is solved. The ovenPW starts out at a duty cycle of 57% (oven is on for 570ms of each second). As the component heats up, the ovenPW falls to the nominal value of 17% and shuts off.
  • The first value of ovenPW is 57% and this corresponds to 81ºF. BJ agrees that the ambient temperature in the building is about 81ºF, so this boundary condition is consistent.

You can tentatively eliminate several possible root causes, such as A/D converter subsystem failure, wrong heater output signal, and bad up-to-temperature logic. This experiment has confirmed your understanding of the system but you feel no closer to an answer than you did this morning.

Debug by thinking
Review again the possible root causes for sporadic events. We haven't yet explored boundary conditions very well, or unexpected input conditions. Is it really possible for the oven to be turned on 100% of the time?

Working the ovenPW math backwards, you determine that an initial temperature of 33ºF (101 A/D counts) corresponds to an ovenPW of 100%. So it is possible for the heater to turn on fully if the material starts around freezing, and maybe the oven somehow got stuck at 100%. You're skeptical but ask BJ anyway.

“Hey BJ, was the material really cold—below freezing—the afternoon the oven failed? If so, the oven will turn on 100%.”

BJ chuckles, “Not likely on a summer af

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.