EMI in the serial channels
Another area that typically gets affected by EMI is serial
communications channels. Even if a noise tolerant physical layer such
as RS485 or LVDS (Low-voltage differential signaling) is used in a
communications link, data can be corrupted by noise.
Software can detect these errors and provide reasonable response.
Simple errors contained in a single byte may be detected through a
framing or parity error. Typically a UART provides this built-in
detection.
If such an error is detected, the receiving device should require a
packet retransmission. Depending on the protocol, this may be
accomplished by not acknowledging the packet, or sending a special
error acknowledgement back. A protocol can be designed in which data
includes error correcting codes (ECC).
This approach provides detection and correction of a limited number of
bit errors.
The disadvantage is the overhead of the additional error correcting
bits, and the inability to flawlessly deal with multiple bit errors. A
more robust (and highly recommended) method of detecting errors in a
communications packet transmission is to include a Cyclic Redundancy Check (CRC) as
part of the packet.
A two byte CRC provides 100% coverage of bit failures occurring
within the same byte and 99.998% coverage of all other bit failures.
The CRC can be used to detect errors, but does not provide any means
for error correction.
A mismatch of the CRC to the value calculated based on the received
data should result in the receiving device again requiring a
retransmission of the packet (Figure
3, below) . As long as the EMI corrupts only a small
percentage of the packets, and the system was designed with sufficient
bandwidth to begin with, the overhead of the retransmissions for failed
packets will most likely not lead to unacceptable communications
throughput.
 |
| Figure
3. Communications with packet CRCs |
A similar method is to include a checksum of the packet as part of
the packet transmission. The checksum is easier (i.e. faster) to
compute, but provides significantly less coverage of bit failures in
the packet transmission.
For example, toggling a bit in one position in one byte, and
toggling the same bit position of the opposite value in another byte
would lead to the same checksum even though 2 bytes have been
corrupted. There is much information available on these and many other
communications error detection and/or correction schemes. It is highly
advisable to implement the one that best matches up to the requirements
of your device, and the environment that your device will be used in.
Volatile memory corruption
EMI can cause volatile data memory to become corrupted. These errors
are difficult to detect, but a few methods can be employed in some
cases. When only a specific range is valid for a data element, then a
plausibility check of the data should occur before it is used.
Along these same lines, when a switch statement is used on a
variable, a 'default' case should always be included. This provides a
minimal amount of error detection, but more importantly, it prevents
the program from executing code based on a data value that was not
accounted for.
If the data in question changes and is accessed infrequently, then
the data can be verified through the use of a CRC or checksum of the
block of data. When using a checksum, a new checksum value can be
generated more quickly if the old data value is subtracted out first,
and then the new data value is added in.
These methods require the overhead of additional time. They should
only be used where appropriate. If 3 copies of the data are stored,
then a vote can be taken to choose the value to use. This allows the
program to recover very gracefully. If one of the 3 copies of the data
is corrupted, the corrupted value can be restored.
A simple macro can be written to handle the retrieving and
verification of this data. If only 2 copies are stored, then the 2
copies must match for the data to be considered valid. If not, then an
error handling routine must be called. These methods require the
overhead of additional time and additional RAM. They should only be
used where appropriate.
When the program does not fill the entire program memory, it is
advisable to fill the remaining program memory with:
* a software interrupt instruction, if the microcontroller has such
an instruction.
* an illegal instruction, if the microcontroller can trap illegal
instructions.
* NOPs, or some other instruction which has no cumulative net effect.
At the end of this block should be a jump to an error handling
routine. If the program execution would get lost and jump into this
block, the NOPs (or similar) would be executed until the jump to the
error handler is reached.
The first two methods are preferable if available since the
vectoring to the error handler will occur much quicker. This could aid
in debugging the problem.