A number of vendors have introduced commercial OSs that are
advertised as "high availability" with "99.999% uptime," which is
often referred to as "five nines." From the way the term five nines
is used, vendors imply that their products are highly reliable. But
just how reliable are these products in safety-critical
applications?
It’s true that these expressions sound reliable. After
all, they require that the system fail for no more than 5 minutes
per year. It allows vendors to claim that high availability is
suitable for safety-critical systems such as aircraft and medical
equipment. But high availability is not the same as "always
available" or "highest availability." The term 99.999% uptime
clearly states that the system will be unavailable for 1/100,000th
of the time, and this reliability rate may not be adequate for
systems in which failure is unacceptable.
Many aerospace, medical, automotive, and telecommunications
customers find a 1-in-100,000 downtime unacceptable. If, for
instance, the RTOS that runs the avionics of a fly-by-wire
commercial airliner suddenly becomes unavailable when the airliner
is 100 feet over the runway, the pilot may lose control of the
flaps and the throttle with catastrophic results. Because there are
about 100,000 commercial airline flights each day, using a 99.999%
uptime RTOS averages out to one commercial airliner crash per day,
which is clearly impractical for aircraft flight systems.
The average commuter uses the microprocessor-controlled
anti-lock brakes in his or her car about 100 times per day. If the
high-availability RTOS that controls the anti-lock brakes of a car
is unavailable, the car may rear-end the car in front of it, fail
to make the curve in the road, or sail through a red light into
cross traffic. With more than 100 million commuters in the United
States, each hitting the brakes 100 times per day, using a
high-availability RTOS could result in 100,000 product liability
law suits per day.
The range of such safety-critical applications goes far beyond
the obvious (e.g., aircraft, automotive, trains, medical equipment,
military equipment, nuclear power plants). If industrial control
equipment malfunctions, it can kill or maim factory workers. If a
telecommunications switch fails, an entire city can lose telephone
service, causing lifesaving 911 calls to be lost. If traffic
signals malfunction, cars will collide in intersections. If power
distribution systems fail, the resulting blackout will cause
traffic signals, telephone switches, and emergency response systems
to fail.
Even non-safety-critical consumer products are expected to work
more than 99.999% of the time. Most televisions, refrigerators,
clocks, VCRs, set top boxes, video games, stoves, microwave ovens,
remote controls, thermostats, hair dryers, electric toothbrushes,
and light switches run for years without even 1 minute of down
time. Developers of these products wouldn’t be happy to find
that adding a high-availability RTOS reduces reliability while
increasing expense. Nor would they be willing to look their
customers (let alone a jury) in the eye and say they knowingly used
a RTOS that is advertised to have a predicted failure rate of 1 in
100,000.
The adequacy of five nines OSs in safety-critical systems may be
facing a showdown with the Federal Aviation Administration (FAA).
Some vendors of five nines RTOSs claim that they are working on
getting FAA DO-178B Level A Flight Certification for documentation,
reliability, and testing. But some of these RTOSs may be too
complex to ever meet this certification standard. No
flight-critical application should accept this amount of
unreliability.
So, is five nines reliable enough? We at Green Hills think not,
and have designed our Integrity OS for a much higher level of
reliability.