Firmware, Not Hardware
To my knowledge there's no literature about how metastability affects
software, yet it poses very real threats to building a reliable system.
Hardware designers smugly cure their metastability problem using the
two stage flops described. Their domain is that of a single bit, whose
input changed just about the same time the clock transitioned. Thinking
in such narrow terms it's indeed reasonable to accept the inherent
random output the flops generate.
However, we software folks are reading parallel I/O ports, each
perhaps 8 bits wide. That means there are 8 flip-flops in the input
capture register, all driven by the same clock pulse.
Let's look at what might happen. The encoder changes from 0xff to
0x100. This small difference might represent just a tiny change in
angle. We request a read at just about the same time the data changes,
our input operation strobes the capture register's clock creating a
violation of set-up or hold time.
Every input bit changes, each of the flip-flops inside the register
goes metastable. After a short time the oscillations die out, but now
every bit in the register is random. Though the hardware folks might
shrug and complain that no one knows what the right value was, since
everything changed as clock arrived, in fact the data was around 0xff
or 0x100. A random result of, say, 0x12 is absurd and totally
unacceptable, and may lead to crazy system behavior.
The case where data goes from 0xff to 0x100 is pathological since
every bit changes at once. The system faces the same peril whenever
lots of bits change. 0x0f to 0x10. 0x1f to 0x20. The upper, unchanging
data bits will always latch correctly, but every changing bit is at
risk.
Why not use the multiple flip-flop solution? Connect two input
capture registers in series, both driven by the same clock. Though this
will eliminate the illegal logic states and oscillations, the second
stage's output will be random as well.
One option is to ignore metastability and hope for the best. Or use
very fast logic with very narrow set-up/hold time windows to reduce the
odds of failure. If the code samples in the inputs infrequently it's
possible to reduce metastability to one chance in millions or even
billions. Building a safety critical system? Feeling lucky?
It is possible to build a synchronizer circuit that takes a request
for a read from the processor, combines it with a data available bit
from the I/O device, responding with a data-OK signal back to the CPU.
This is nontrivial and prone to errors.
An alternative is to use a different coding scheme for the I/O
device. Buy an encoder with Gray code output, for example (if you can
find one). Gray code is a counting scheme where only a single bit
changes between numbers, as follows:
0 000
1 001
2 011
3 010
4 110
5 111
6 101
7 100
Gray code makes sense if, and only if, your code reads the device
faster than it's likely to change, and if the changes happen in a
fairly predictable fashion—like counting up. Then there's no real
chance of more than a single bit changing between reads, if the inputs
go metastable only one bit will be wrong. The result will still be
reasonable.
Another solution is to compute a parity or checksum of the input
data before the capture register. Latch that, as well, into the
register. Have the code compute parity and compare it to that read, if
there's an error do another read.
Although I've discussed adding an input capture register, please
don't think that this is the root cause of the problem. Without that
register—if you just feed the asynchronous inputs directly into the CPU
- it's quite possible to violate the processor's innate set-up/hold
times.
There's no free lunch, all logic has physical constraints we must
honor. Some designs will never have a metastability problem. It always
stems from violating set-up or hold times, which in turn comes from
either poor design or asynchronous inputs.