Metastability and Firmware - Embedded.com

Metastability and Firmware

Metastability is not just a social disease. It could hamper your software's ability to read good data from hardware.

Last month I discussed the general problem of making software that reads asynchronous hardware reliable. Some very simple situations-like a timer that uses an interrupt service routine-can result in rare but quite serious faults. Whenever we have a physical input to the computer that requires more than one I/O read, and continues to run during input, there's a chance the data will be corrupt.

Suppose a robot uses a 10-bit encoder to monitor the angular location of a wrist joint. As the wrist rotates, the encoder sends back a binary code 10 bits wide, representing the joint's current position. An 8-bit processor requires two distinct I/O instructions-two byte-wide reads-to get the data. No matter how fast the computer might be, there's a finite time between the reads during which the encoder data may change.

The wrist is rotating. A get_position routine reads 0xff from the low part of the position data. Then, before the next instruction, the encoder rolls over to 0x100. get_position reads the high part of the data-now 0x1-and returns a position of 0x1ff, clearly in error and perhaps even impossible.

This is a common problem. Handling input from a two-axis controller? If the hardware continues to move during our reads, the X and Y data will be slightly uncorrelated, perhaps yielding impossible results. A friend of mine once tracked a rare autopilot failure to the way the code read a flux-gate compass, the output of which is a pair of related quadrature signals. Reading them at disparate times while the vessel continued to move yielded impossible heading data.

Input capture register

Hardware folks have dealt with similar problems for decades. Their usual solution is to add an input capture register between the I/O device and the processor. The register is nothing more than a parallel latch as wide as the input data. The 10-bit encoder has a 10-bit register and the encoder's output goes to the register's inputs. A single clock line drives each flip-flop in the latch; when strobed, it locks the data into the register. The output is fed to a pair of processor input ports.

When it's time to read a safe, unchanging value, the code issues a “hold the data now” command that strobes encoder values into the latch. So all 10 bits are stored and can be read by the software at any time, with no fear of things changing between reads.

Some designers tie the register's clock input to one of the port control lines. The I/O read instruction then automatically strobes data into the latch, assuming one is wise enough to ensure the register latches data on the leading edge of the clock.

The input capture register is a simple way to suspend moving data during the duration of a couple of reads. At first glance it seems perfectly safe. But a bit of analysis shows that it is not reliable for asynchronous inputs. We're using hardware to fix a software problem, so we must be aware of the limitations of physical logic devices.

To simplify things for a minute, let's zoom in on that input capture register and examine just one of its bits. Each gets stored in a flip-flop, a bit of logic that might have only three connections: data in, data out, and clock. When the input is a one, strobing clock puts a 1 at the output.

But suppose the input changes at about the same time the clock cycles? What happens? The short answer is that no one knows.

Metastable states

Every flip-flop has two critical specifications we violate at our peril. Set-up time is the minimum number of nanoseconds that input data must be stable before clock comes. Hold time tells us how long to keep the data present after clock transitions. These specs vary depending on the logic device. Some might require tens of nanoseconds of set-up and/or hold time; others need an order of magnitude less.

If we tend to our knitting we'll respect these parameters and the flip-flop will always be predictable. But when things are asynchronous-say the wrist rotates at its own rate and the software does a read whenever it needs data-it's possible we'll violate set-up or hold time.

Suppose the flip-flop requires three nanoseconds of set-up time. Our data changes within that window, flipping state perhaps a single nanosecond before clock transitions. The device will go into a metastable state where the output gets very strange indeed.

When the spec is violated, the device really doesn't know if we presented a 0 or a 1. Its output goes not to a logic state, but to a half-level (in between the digital norms), or it will oscillate, toggling wildly between states. The flip-flop is metastable.

This craziness doesn't last long. After anywhere from a few to 50 nanoseconds, the oscillations damp out or the half-state disappears, leaving the output at a valid 1 or 0. But which one is it? This is a digital system, and we expect ones to be ones, and zeroes to be zeroes.

The output is random. Bummer, that. You cannot predict which level it will assume. That sure makes it hard to design predictable digital systems!

Hardware folks feel that the random output isn't a problem. Since the input changed at almost exactly the same time the clock strobed, either a 0 or a 1 is reasonable. If we had clocked just a hair ahead or behind we'd have gotten a different value anyway. Philosophically, who knows which state we measured? Is this really a big deal? Maybe not to the EEs, but this impacts our software in a big way, as we'll see shortly.

Metastability occurs only when clock and data arrive almost simultaneously; the odds increase as clock rates soar. The type of logic component used is an equally important factor. Slower logic (like 74HCxx) has a much wider metastable window than faster devices (say, 74FCTxx). At reasonable rates, the odds of the two asynchronous signals arriving closely enough in time to cause a metastable situation are low. Measureable? Yes. Important? Certainly. With a 10MHz clock and 10kHz data rate, using typical but not terribly speedy logic, metastable errors occur about once a minute. Though infrequent, no reliable system can stand that failure rate.

The classic metastable fix uses two flip-flops connected in series. Data goes to the first and its output feeds the data input of the second. Both use the same clock input. The second flop's output will be correct after two clocks, since the odds of two metastable events occurring back-to-back are almost nil. With two flip-flops, and at reasonable data rates, errors occur millions or even billions of years apart. Good enough for most systems.

But “correct” means the second stage's output will not be metastable: it's not oscillating, nor is it at an illegal voltage level. There's still an equal chance the value will be in either legal logic state.

Firmware, not hardware

To my knowledge there's no literature about how metastability affects software, yet it poses very real threats to building a reliable system.

Hardware designers smugly cure their metastability problem using the two stage flops described previously. Their domain is that of a single bit, whose input changed just about the same time the clock transitioned. Thinking in such narrow terms, it's indeed reasonable to accept the inherent random output the flops generate.

But we software folks are reading parallel I/O ports, each perhaps 8-bits wide. That means eight flip-flops are present in the input capture register, all driven by the same clock pulse.

Let's look at what might happen. The encoder changes from 0xff to 0x100. This small difference might represent just a tiny change in angle. We request a read at about the same time the data changes. Our input operation strobes the capture register's clock, creating a violation of set-up or hold time. Every input bit changes, and each of the flip-flops inside the register goes metastable. After a short time the oscillations die out, but now every bit in the register is random. Though the hardware folks might shrug and complain that no one knows what the right value was, since everything changed as clock arrived, the data was, in fact, around 0xff or 0x100. A random result of, say, 0x12 is absurd and totally unacceptable, and may lead to crazy system behavior.

The case where data goes from 0xff to 0x100 is pathological since every bit changes at once. The system faces the same peril whenever lots of bits change. 0x0f to 0x10. 0x1f to 0x20. The upper, unchanging data bits will always latch correctly, but every changing bit is at risk.

Why not use the multiple flip-flop solution? Connect two input capture registers in series, both driven by the same clock. Though this will eliminate the illegal logic states and oscillations, the second stage's output will be random as well.

One option is to ignore metastability and hope for the best. Or use very fast logic with very narrow set-up/hold time windows to reduce the odds of failure. If the code samples the inputs infrequently, it's possible to reduce metastability to one chance in millions or even billions. Building a safety critical system? Feeling lucky?

It is possible to build a synchronizer circuit that takes a request for a read from the processor, combines it with a data-available bit from the I/O device, responding with a data-OK signal back to the CPU. But this is non-trivial and prone to errors.

An alternative is to use a different coding scheme for the I/O device. Buy an encoder with Gray Code output, for example (if you can find one). Gray Code is a counting scheme where only a single bit changes between numbers, as follows:

000
001
011
010
110
111
101
100

Gray code makes sense if, and only if, your code reads the device faster than it's likely to change, and if the changes happen in a fairly predictable fashion-like counting up. Then there's no real chance of more than a single bit changing between reads; if the inputs go metastable, only one bit will be wrong. The result will still be reasonable.

Another solution is to compute a parity or checksum of the input data before the capture register. Latch that into the register as well. Have the code compute parity and compare it to that read. If there's an error, do another read.

Though I've discussed adding an input capture register, please don't think that this is the root cause of the problem. Without that register-if you just feed the asynchronous inputs directly into the CPU-it's quite possible to violate the processor's innate set-up/hold times. There's no free lunch. All logic has physical constraints we must honor.

Don't panic!

Some designs will never have a metastability problem. It always stems from violating set-up or hold times, which in turn comes from either poor design or asynchronous inputs.

All of this discussion has revolved around asynchronous inputs, when the clock and data are unrelated in time. Be wary of anything not slaved to the processor's clock. Interrupts are a notorious source of problems. If caused by, say, someone pressing a button, be sure that the interrupt itself,and the vector-generating logic, don't violate the processor's set-up and hold times.

But in computer systems, most things do happen synchronously. If you're reading a timer that operates from the CPU's clock, it is inherently synchronous to the code. From a metastability standpoint, it's totally safe.

Bad design, though, can plague any electronic system. Every logic component takes time to propagate data. When a signal traverses many devices, the delays can add up significantly. If the data then goes to a latch, it's quite possible that the delays may cause the input to transition at the same time as the clock. Instant metastability.

Designers are pretty careful to avoid these situations, though. Do be wary of FPGAs and other components where the delays vary depending on how the software routes the device. And when latching data or clocking a counter, it's not hard to create a metastability problem by using the wrong clock edge. Pick the edge that gives the device time to settle before it's read.

What about analog inputs? Connect a 12-bit A/D converter to two 8-bit ports and we'd seem to have a similar problem: the analog data can wiggle all over, changing while we read the two ports. However, an input capture register isn't necessary, because the converter itself generally includes a sample-and-hold block that stores the analog signal while the A/D digitizes. Most A/Ds then store the digital value until we start the next conversion.

There's a lot of information about metastability in circuits. One of the best is a Texas Instruments report (# SDYA006) named “Metastable Response in 5-V Logic Circuits.” The formulas and empirical data included will help you quantitatively calculate the risks in your designs.

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at .

Return to August 2001 Table of Contents

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.