Margin -



The video made my jaw drop. Flight 292’s nosegear was cocked sideways, twisted 90 degrees from its normal position. As the pilot began his approach I wanted to turn away but was glued to the screen in fascinated horror. The wheel touched down, smoked, burst into flame and the tire tore away, nothing but metal grinding along the runway.

Astonishingly, the strut held fast.

Those seconds summed up a lot of the nature of engineering. The strut held fast , for loads that far exceed anything the plane experiences in normal operation. Engineers designed enough margin into the system to handle almost unimaginable and unanticipated forces.

It also shows the human side of the mechanical world. This failure had apparently been experienced before by other Airbus 320s. Something went wrong with the system used by the air industry to eliminate known defects.

I’m struck by the difference between failures in mechanical systems and those in computer programs. Software is topsy-turvy. Mechanical engineers can beef up a strut to add margin, to handle unexpected loads. EEs specify components heftier than needed and wires that can take more current than anticipated. They handle surges with fuses and weak links.

In software if just one bit out of hundreds of millions is wrong the application completely crashes. Margin is difficult, perhaps impossible, to add. Exception handlers can serve as analogs to fuses, but they’re notoriously hard to test and generally have a bug rate far higher than that of the application.

Worse, we write code with the assumption that everything will work and there won’t be any unexpected inputs. So buffer overflows are rampant. This complacent attitude isn’t exclusive to desktop developers; after a software error destroyed Ariane 5 the review board cited a culture that assumed software can’t fail. If it works in test it will work forever.

A plane, bridge and dare I say levee must have a reliability vanishingly close to 100%. So mechanical engineers design a structure that takes 110% or 150% of expected loads.

Many software apps require just as much reliability. But we can’t add margin, so must build code that’s 99.999% correct or better.

Yet humans aren’t good at perfection. In school a 90% is an “A”. If our code earned an “A,” a million line-of-code program would have 100,000 errors.

Software is inherently fragile. We can, and must, add great exception handlers and use the very best methods to produce correct code. But until we find a way to make code that is more robust than the environment it’s in, the elusive goal of perfection is our only hope.

What do you think? How can we add design margin to code?

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at . His website is .

Good points. There is an analog to margin in software,but remember, adding margin is an engineering judgment.The aircraft designer had to trade strength vs. weight in designing that strut.

I'm designing some array classes for a scientific application.Do I build in bounds checking? There's a cost vs. benefit there.When doing a large matrix multiply, the checking is a large cost.The benefit is only at the edges.

So I use the preprocessor to enable checking during the developmentstage and disable it in the “release” version.

I guess that is akin to using the strong landing strut during flighttesting and then replacing it with something weaker when we're”sure” of the load it will have to bear. That might not be theapproach the passengers of flight 292 would endorse!

I'm sure there are a lot of other example of margin in softwaredevelopment that we are conditioned to ignore in the name ofefficiency. Food for thought.

-Jay Trischman

I worked the Nuclear Industry for many years developing firmware forSafety Critical Systems. I learn techniques to write code with margin,for example, we developed coding standards that did not allow logic thatdepends on single bit changes. Definitions of TRUE and FALSE were bitpatterns. All logic used variables that were compared to bit patterns.Single bit variables were not allowed. Logic comparisons had to beconsistently used. For example,

#define TRUE 0xAA#define FALSE 0x55

char hi_trip_condition;

if (hi_trip_condition == TRUE) { }else { }

All other compares to “hi_trip_condition” had to be compared againstTRUE. The following was not allowed in the rest of the code:

if (hi_trip_condition == FALSE) { }else { }

since if there was a bit error and “hi_trip_condition” was neither TRUEnor FALSE an indeterminate situation could occur. When all theconditional statements used the same compare, the logic would alwaysproduce a deterministic result when a bit error in RAM would occur. Inessence, this allows for some fault tolerance in the code.

When an engineer was designing a the code he/she would consider whatconditional to use since he/she could bias the logic so it alwaysproduce a certains result when an error occurred.

– Sheridan Kooyers

You’ve raised a very interesting topic. However, using today’s methodologies, it’s not really fair to compare computer programming with a mathematical discipline such as engineering. There is, unfortunately, little mathematical basis for these methodologies, and it’s sort of an apples and oranges comparison.

Having said that, methodologies have indeed come a long way to improving program quality, but, as you point out, there’s no way to ensure that we build 10 or 20, or 50% of margin into our programs, quantitatively speaking. Most of these techniques are qualitative in nature.

Indeed, much research has been performed in the area of provably correct computing. This is a mathematically based programming discipline. However, I don’t believe that such research has made much headway in the mainstream, and may in fact not be ready for prime time. I’m not up on current research in this area, but from what I recall, it confirms your assertion about interrupts being a prime cause of program failure. Perhaps we need to start with a different hardware approach.

Thanks for making me think about this stuff again ; it’s been a while!

P.S. In case you’re not aware of this, 90% is not universally used as a “A”. In Fairfax County, an “A” is 94%. Proving yet again that Virginia is better than Maryland. J

– Michael Sobel

I saw your column on design margins in hardware and software and may have some relevant feedback.

I think we are talking about two different types of problems here. The first type is when software fails under load. Due to performance limitations a program may fail to meet its real-time deadlines causing system failure. This type of problem can be predicted at least using simulation and modeling techniques such as rate monotonic scheduling and software performance engineering. One can build in safety margins by using faster hardware in this case, much like hardware engineers design extra capacity into their systems.

The second type is when a coding or design error causes a system failure because the program fails to meet its specification. There are a number of techniques in the literature for addressing these quality problems but only one of them involves adding a design margin. Here are three that I come to mind, but there are others.

Software Reliability Engineering as described by John Musa. A set of operational profiles are developed after a criticality analysis. The identified safety critical functions are tested and verified much more thoroughly than less critical parts of the system. Formal verification techniques ; use of formal mathematical techniques to verify that the specifications, design, and code of critical modules are correct. Praxis Critical Systems in the U.K. is very good at this. Software redundancy: have three separate teams code a critical module independently and include all versions in the deliverable product. The final system gets results from all the modules and compares them. The result given by the majority of the modules is used. This is the design margin – we are betting that at least two out of three are correct and didn’t hit a common bug. This may triple the cost of the critical module. Whether the technique is effective or not is a matter of dispute.

The problem with software is that a single error in an untested, critical execution path is the Achilles heel of the entire system. Software Reliability Engineering tries to ensure that the critical paths are all tested and verified. The use of formal methods tries to ensure that the critical paths have no errors. Redundancy tries to circumvent the errors with multiple implementations. All of them assume that the specification is correct to begin with, which may not be the case.

It still boils down to using an effective development process that is tailored to the problem being solved.

– Steve Hanka

I firmly believe we can do a much better job building firmware systems that present some margin but, like in the case you pointed in the Flight 292’s nosegear, the designer could add some extra weight and let’s not forget, he also had some spare time to think about it and calculate!

I remember years ago when I was assigned to change the assembly code of an automotive application which had 6 out of 2048 bytes left in flash code memory. The car maker required to add 5 small but new features and said to do so we could take out 4 other features (as if it was that easy).

After sweating for two weeks we made it and you know what? At the time some people went berserk when I suggested not to use more than 90% of flash and ram in future application to create some margin to solve any future problems – the exact same margin you talked about. (needles to say I don't work in the place anymore!)

– Anonymous

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.