Back to the Basics - Practical Embedded Coding Tips: Part 3 - Embedded.com

Back to the Basics – Practical Embedded Coding Tips: Part 3

Every flip-flop has two critical specifications we violate at ourperil. “Set-up time” is the minimum number of nanoseconds that inputdata must be stable before clock comes. “Hold time” tells us how longto keep the data present after clock transitions.

These specs vary depending on the logic device. Some might requiretens of nanoseconds of set-up and/or hold time; others need an order ofmagnitude less.

Figure9.1: Setup and Hold Times

If we tend to our knitting we'll respect these parameters and theflip-flop will always be totally predictable. But when things areasynchronous—say, the wrist rotates at it's own rate and the softwaredoes a read whenever it needs data—there's a chance the we'll violateset-up or hold time.

Suppose the flip-flop requires 3 nanoseconds of set-up time. Ourdata changes within that window, flipping state perhaps a singlenanosecond before clock transitions. The device will go into ametastable state where the output gets very strange indeed.

By violating the specification the device really doesn't know if wepresented a zero or a one. It's output goes, not to a logic state, butto either a half-level (in between the digital norms) or it willoscillate, toggling wildly between states. The flip-flop is metastable.

Figure9.2: A Metastable State

This craziness doesn't last long; typically after a few to 50nanoseconds the oscillations damp out or the half-state disappears,leaving the output at a valid one or zero. But which one is it? This isa digital system, and we expect ones to be ones, and zeroes zeroes.

The output is random. Bummer, that. You cannot predict whichlevel it will assume. That sure makes it hard to design predictabledigital systems!

Hardware folks feel that the random output isn't a problem. Sincethe input changed at almost exactly the same time the clock strobed,either a zero or a one is reasonable. If we had clocked just a hairahead or behind we'd have gotten a different value, anyway.Philosophically, who knows which state we measured? Is this really abig deal? Maybe not to the EEs, but this impacts our software in a bigway, as we'll see shortly.

Metastability occurs only when clock and data arrive almostsimultaneously; the odds increase as clock rates soar. An equallyimportant factor is the type of logic component used: slower logic(like 74HCxx) has a much wider metastable window than faster devices(say, 74FCTxx).

Clearly at reasonable rates the odds of the two asynchronous signalsarriving closely enough in time to cause a metastable situation arelow, measurable, yes, important, certainly. With a 10 MHz clock and 10KHz data rate, using typical but not terribly speedy logic, metastableerrors occur about once a minute. Though infrequent, no reliable systemcan stand that failure rate.

The classic metastable fix uses two flip-flops connected in series.Data goes to the first; its output feeds the data input of the second.Both use the same clock input. The second flop's output will be”correct” after two clocks, since the odds of two metastable eventsoccurring back-to-back are almost nil. With two flip-flops, atreasonable data rates errors occur millions or even billions of yearsapart, good enough for most systems.

However “correct” means the second stage's output will not bemetastable: it's not oscillating, nor is it at an illegal voltagelevel. There's still an equal chance the value will be in either legallogic state.

Firmware, Not Hardware
To my knowledge there's no literature about how metastability affectssoftware, yet it poses very real threats to building a reliable system.

Hardware designers smugly cure their metastability problem using thetwo stage flops described. Their domain is that of a single bit, whoseinput changed just about the same time the clock transitioned. Thinkingin such narrow terms it's indeed reasonable to accept the inherentrandom output the flops generate.

However, we software folks are reading parallel I/O ports, eachperhaps 8 bits wide. That means there are 8 flip-flops in the inputcapture register, all driven by the same clock pulse.

Let's look at what might happen. The encoder changes from 0xff to0x100. This small difference might represent just a tiny change inangle. We request a read at just about the same time the data changes,our input operation strobes the capture register's clock creating aviolation of set-up or hold time.

Every input bit changes, each of the flip-flops inside the registergoes metastable. After a short time the oscillations die out, but nowevery bit in the register is random. Though the hardware folks mightshrug and complain that no one knows what the right value was, sinceeverything changed as clock arrived, in fact the data was around 0xffor 0x100. A random result of, say, 0x12 is absurd and totallyunacceptable, and may lead to crazy system behavior.

The case where data goes from 0xff to 0x100 is pathological sinceevery bit changes at once. The system faces the same peril wheneverlots of bits change. 0x0f to 0x10. 0x1f to 0x20. The upper, unchangingdata bits will always latch correctly, but every changing bit is atrisk.

Why not use the multiple flip-flop solution? Connect two inputcapture registers in series, both driven by the same clock. Though thiswill eliminate the illegal logic states and oscillations, the secondstage's output will be random as well.

One option is to ignore metastability and hope for the best. Or usevery fast logic with very narrow set-up/hold time windows to reduce theodds of failure. If the code samples in the inputs infrequently it'spossible to reduce metastability to one chance in millions or evenbillions. Building a safety critical system? Feeling lucky?

It is possible to build a synchronizer circuit that takes a requestfor a read from the processor, combines it with a data available bitfrom the I/O device, responding with a data-OK signal back to the CPU.This is nontrivial and prone to errors.

An alternative is to use a different coding scheme for the I/Odevice. Buy an encoder with Gray code output, for example (if you canfind one). Gray code is a counting scheme where only a single bitchanges between numbers, as follows:

0 000
1 001
2 011
3 010
4 110
5 111
6 101
7 100

Gray code makes sense if, and only if, your code reads the devicefaster than it's likely to change, and if the changes happen in afairly predictable fashion—like counting up. Then there's no realchance of more than a single bit changing between reads, if the inputsgo metastable only one bit will be wrong. The result will still bereasonable.

Another solution is to compute a parity or checksum of the inputdata before the capture register. Latch that, as well, into theregister. Have the code compute parity and compare it to that read, ifthere's an error do another read.

Although I've discussed adding an input capture register, pleasedon't think that this is the root cause of the problem. Without thatregister—if you just feed the asynchronous inputs directly into the CPU- it's quite possible to violate the processor's innate set-up/holdtimes.

There's no free lunch, all logic has physical constraints we musthonor. Some designs will never have a metastability problem. It alwaysstems from violating set-up or hold times, which in turn comes fromeither poor design or asynchronous inputs.

All of the discussion so far has revolved around asynchronousinputs, when the clock and data are unrelated in time. Be wary ofanything not slaved to the processor's clock. Interrupts are anotorious source of problems.

If caused by, say, someone pressing a button, be sure that theinterrupt itself, and the vector-generating logic, don't violate theprocessor's set-up and hold times.

However, in computer systems most things do happen synchronously. Ifyou're reading a timer that operates from the CPU's clock, it isinherently synchronous to the code. From a metastability standpointit's totally safe.

Bad design, though, can plague any electronic system. Every logiccomponent takes time to propagate data; when a signal traverses manydevices the delays can add up significantly. If the data then goes to alatch it's quite possible that the delays may cause the input totransition at the same time as the clock. Instant metastability.

Designers are pretty careful to avoid these situations, though. Dobe wary of FPGAs and other components where the delays vary dependingon how the software routes the device. In addition, when latching dataor clocking a counter it's not hard to create a metastability problemby using the wrong clock edge. Pick the edge that gives the device timeto settle before it's read.

What about analog inputs? Connect a 12 bit A/D converter to two 8bit ports and we'd seem to have a similar problem: the analog data canwiggle all over, changing during the time we read the two ports.

However, there's no need for an input capture register because theconverter itself generally includes a “sample and hold” block, whichstores the analog signal while the A/D digitizes. Most A/Ds then storethe digital value till we start the next conversion.

Other sorts of inputs we use all share this problem. Suppose a robotuses a 10 bit encoder to monitor the angular location of a wrist joint.As the wrist rotates the encoder sends back a binary code, 10 bitswide, representing the joint's current position. An 8 bit processorrequires two distinct I/O instructions—two byte-wide reads—to get thedata. No matter how fast the computer might be there's a finite timebetween the reads during which the encoder data may change.

The wrist is rotating. A “get_position” routine reads 0xff from thelow part of the position data. Then, before the next instruction, theencoder rolls over to 0x100. “get_position” reads the high part of thedata—now 0x1—and returns a position of 0x1ff, clearly in error andperhaps even impossible.

This is a common problem, handling input from a two-axis controller.If the hardware continues to move during our reads, then the X and Ydata will be slightly uncorrelated, perhaps yielding impossibleresults.

One friend tracked a rare autopilot failure to the way the code reada flux-gate compass, whose output is a pair of related quadraturesignals. Reading them at disparate times, while the vessel continued tomove, yielded impossible heading data.

Next in Part 4: Dealing withinterrupt latency
To read Part 1 in this series, go to Reentrancy, atomic variables and recursion.
To read Part 2 in this series, go to Asynchronous Hardware/Firmware

JakobEngblom (jakob@virtutech.com)is technical marketing manager atat Virtutech.He has a MSc in computer science and a PhD in Computer Systems fromUppsala University, and hasworked with programming tools and simulation tools for embedded andreal-time systems since 1997. 
He was a contributor of material to “ The Firmware Handbook,” editedby Jack Ganssle, upon which this series of articles was based andprintedwith permission from Newnes, a division of Elsevier.Copyright 2008.  Forother publications by Jakob Engblom, see www.engbloms.se/jakob.html.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.