Software techniques for comprehensive EMC testing of embedded systems
By Joseph Brotz
Embedded.com
(05/28/08, 04:38:00 PM EDT)
Almost every product designed is required to pass a suite of tests known as Electromagnetic Compatibility (EMC) tests. The suite of EMC tests usually includes some form of each of the following: electro-static discharge (ESD), radiated emissions, conducted emissions, radiated susceptibility, and conducted susceptibility.

There are many versions of each of these types of tests. The version that applies to your product depends on the products application and the agency approvals (UL, CSA, EC, etc.) that are required. A common misconception is that EMC issues are the domain of hardware engineers, when in reality an embedded software engineer can and must be involved!

This article will cover many of the basics for how a software engineer can provide help with product EMC performance and testing. But it is certainly by no means an exhaustive list. Many times a little bit of thinking outside of the box can provide the solution to a pesky EMC problem. So don't be afraid to try things!

Typically, there are at least four ways the embedded software engineer can aid in the process (order of value added):

* improve performance against susceptibility tests,
* provide debug assistance for susceptibility tests,
* provide automated execution of device functionality, and
* minimize emissions.

Improving Performance against Susceptibility to EMI
First, let's address how an Embedded software engineer would be involved in dealing with the device's susceptibility to Electromagnetic Interference (EMI). The hardware engineer's role is to prevent as much EMI as possible from ever reaching the product's sensitive electronics.

However, sometimes it's impractical for this to be accomplished well enough to pass all tests. Using software techniques to resolve issues with susceptibility to EMI can be quite beneficial because software can almost always remain fluid much later in a product development cycle than hardware can. When a troublesome susceptibility issue is discovered, it quite often would cause significant schedule pain to perform another round of hardware changes.

On the other hand, software is much more likely to be able to absorb the time required to implement solutions. The Embedded software engineer provides the second line of defense. His role is to minimize the disruption to the product caused by EMI that has gotten past the Hardware Engineer's first line of defenses.

Sometimes a passive filtering solution to an EMI susceptibility issue is technically not a challenge to implement, but might be problematic when a small board size is desired. Adding capacitors and ferrite beads can consume valuable board real estate! If a solution can be implemented in software, there is no negative impact on the board size, making it the preferred solution.

Mechanisms are put in place that increases how well the system can tolerate the residual EMI. The residual EMI can adversely affect the system in many ways. It could corrupt:

* microcontroller registers,
* memories,
* communications channels,
* digital I/O, and/or
* analog I/O.

Corruption of microcontroller registers can manifest itself in many different ways. If the program counter is corrupted, the program can execute code out of sequence, or execute from an unprogrammed location in the program memory.

Using watchdog timers
Probably the most time-tested method of verifying flow control of a program is through the use of a watchdog timer. A watchdog is a free-running timer that when expired causes the reset of the microcontroller and/or system. The expiration of the timer is avoided by resetting the timers count value through some simple maintenance method.

Usually this maintenance is as simple as providing a pulse on an input to an external watchdog chip or providing a specific write sequence to an on-chip watchdog register. The maintenance of the watchdog is added at key points in the program.

If the execution of the program goes haywire, the watchdog will not be properly maintained and the system will be reset. Watchdog maintenance must be properly designed however. If the maintenance of the watchdog is placed in a timer ISR, then the maintenance may occur properly even though the foreground execution is lost.

A more acceptable design technique would be to set the output high in the foreground routine, and low in a timer interrupt service routine (ISR). In this case both elements are required for proper maintenance of the watchdog. The benefit is that some noise event might disrupt execution of the device, but instead of ending up in an unknown state or possibly even trapped in an endless loop, the device resets and resumes functioning.

Another method of program flow verification is to use sequence checks interspersed throughout the program. A sequence variable is used to verify the program sequence. At select points, this variable is compared against an expected value. If the value is correct, it is incremented (or somehow set to its next expected value). If the value is not correct, the controller can go to an error state, or invoke a software reset of the controller.

It should be noted that this method adds overhead to the program, and is not easy to maintain, especially when changes to the program flow are required. This method is usually only advisable in safety-critical systems where execution out of sequence can have hazardous effects.

Software filtering
Input errors can be controlled by software filtering (debouncing) of inputs. There are many tried and true debouncing methods. The key is to choose one that best balances the required responsiveness to the input versus the need to tolerate glitches on the input.

For instance, if the system requirements state that you must recognize a button press within 50ms, a good filtering method might be to sample the input in a 10ms ISR and require the input to be seen at the opposite state for at least 40ms (4 consecutive samples) for the debounced state to logically change.

Also, some DSPs, such as TI's TMS320F28XX family, allow you to configure filtering of inputs. The GPIO can have a sampling rate and a qualification count configured. The two together form the sampling window.

For the input qualifier to detect a change in the input, the level of the signal must be stable for the duration of the sampling window width or longer. This is filtering that consumes no instruction cycles after it has been configured, so be sure to use it when you can!

Software can also be used to digitally filter an analog input signal. This is most efficiently accomplished on a digital signal processor, but it can also be implemented on a standard microcontroller too. A FIR (finite impulse response) or IIR (infinite impulse response) filter can be implemented to clean up an input signal that is dirtied by EMI.

Digital filters can provide remarkable performance compared to their analog counterparts. For implementation on a standard microcontroller, there are many 'C' source implementations of these filters freely available on the internet. For implementation on a DSP, almost all DSP vendors provide optimized assembly implementations of these filters that make use of the hardware features of the DSP.

Figure 1. ECG with 60Hz Power Line Interference

For example, consider the case of a medical device that monitors ECG through an A/D converter to detect the R-Wave in a QRS complex. The QRS complex energy is typically in the band from 10Hz to 40Hz. An FIR band pass filter in this band can be extremely effective at eliminating 50/60Hz power line noise as well as other higher frequency noise sources.

The attenuation below 10Hz is more to eliminate uninteresting ECG components, but also eliminates baseline wander and low frequency noise sources. Figure 1 above shows a representative ECG signal with power line interference. Figure 2 below shows the same signal after being filtered by a 100 tap FIR filter with a 0 - 40Hz pass band.

If a transient EMI event corrupts the value of an output port, it is possible that the effect may be minimized by a periodic refresh of the output port values. Of course, in many cases having the output corrupted for any amount of time is unacceptable, but in the cases where it is manageable (an LED output for example), this could be an acceptable solution.

Figure 2. ECG through 100 Tap FIR Filter

This solution can also be applied to configuration registers of on-chip peripherals. A periodic refresh of the configuration of those peripherals might make the temporary corruption of one of those registers tolerable. However, sometimes writing to a peripheral configuration register can cause a reset or other disruption in the operation of the peripheral, so use care when employing this method.

In many cases, the designer has a choice between using a level-triggered or an edge-triggered interrupt. If it is feasible, the level-triggered interrupt should be chosen. Whereas an edge-triggered interrupt typically does not impose any minimum requirement on the interrupt event pulse width, a level-triggered interrupt will.

Typically, an interrupt controller samples the interrupt inputs at some defined frequency (once per instruction cycle, for instance). A level-triggered interrupt will require the event to be present two samples in a row before an interrupt is generated. The minimum width is the sample period. An edge triggered interrupt will look for two consecutive samples that indicate the intended transition.

Since the transition can occur almost immediately before the next sample, there really is no minimum width imposed on the interrupt event. The net effect is that noise is more likely to trigger an edge triggered interrupt than a level triggered interrupt. Microcontrollers sample interrupts using different methods. Understanding how your microcontroller samples interrupts is the key to determining which type of interrupt is less sensitive to noise than the other.

EMI in the serial channels
Another area that typically gets affected by EMI is serial communications channels. Even if a noise tolerant physical layer such as RS485 or LVDS (Low-voltage differential signaling) is used in a communications link, data can be corrupted by noise.

Software can detect these errors and provide reasonable response. Simple errors contained in a single byte may be detected through a framing or parity error. Typically a UART provides this built-in detection.

If such an error is detected, the receiving device should require a packet retransmission. Depending on the protocol, this may be accomplished by not acknowledging the packet, or sending a special error acknowledgement back. A protocol can be designed in which data includes error correcting codes (ECC). This approach provides detection and correction of a limited number of bit errors.

The disadvantage is the overhead of the additional error correcting bits, and the inability to flawlessly deal with multiple bit errors. A more robust (and highly recommended) method of detecting errors in a communications packet transmission is to include a Cyclic Redundancy Check (CRC) as part of the packet.

A two byte CRC provides 100% coverage of bit failures occurring within the same byte and 99.998% coverage of all other bit failures. The CRC can be used to detect errors, but does not provide any means for error correction.

A mismatch of the CRC to the value calculated based on the received data should result in the receiving device again requiring a retransmission of the packet (Figure 3, below) . As long as the EMI corrupts only a small percentage of the packets, and the system was designed with sufficient bandwidth to begin with, the overhead of the retransmissions for failed packets will most likely not lead to unacceptable communications throughput.

Figure 3. Communications with packet CRCs

A similar method is to include a checksum of the packet as part of the packet transmission. The checksum is easier (i.e. faster) to compute, but provides significantly less coverage of bit failures in the packet transmission.

For example, toggling a bit in one position in one byte, and toggling the same bit position of the opposite value in another byte would lead to the same checksum even though 2 bytes have been corrupted. There is much information available on these and many other communications error detection and/or correction schemes. It is highly advisable to implement the one that best matches up to the requirements of your device, and the environment that your device will be used in.

Volatile memory corruption
EMI can cause volatile data memory to become corrupted. These errors are difficult to detect, but a few methods can be employed in some cases. When only a specific range is valid for a data element, then a plausibility check of the data should occur before it is used.

Along these same lines, when a switch statement is used on a variable, a 'default' case should always be included. This provides a minimal amount of error detection, but more importantly, it prevents the program from executing code based on a data value that was not accounted for.

If the data in question changes and is accessed infrequently, then the data can be verified through the use of a CRC or checksum of the block of data. When using a checksum, a new checksum value can be generated more quickly if the old data value is subtracted out first, and then the new data value is added in.

These methods require the overhead of additional time. They should only be used where appropriate. If 3 copies of the data are stored, then a vote can be taken to choose the value to use. This allows the program to recover very gracefully. If one of the 3 copies of the data is corrupted, the corrupted value can be restored.

A simple macro can be written to handle the retrieving and verification of this data. If only 2 copies are stored, then the 2 copies must match for the data to be considered valid. If not, then an error handling routine must be called. These methods require the overhead of additional time and additional RAM. They should only be used where appropriate.

When the program does not fill the entire program memory, it is advisable to fill the remaining program memory with:

* a software interrupt instruction, if the microcontroller has such an instruction.
* an illegal instruction, if the microcontroller can trap illegal instructions.
* NOPs, or some other instruction which has no cumulative net effect.

At the end of this block should be a jump to an error handling routine. If the program execution would get lost and jump into this block, the NOPs (or similar) would be executed until the jump to the error handler is reached.

The first two methods are preferable if available since the vectoring to the error handler will occur much quicker. This could aid in debugging the problem.

Providing Debug Assistance
When running formal EMC tests, the test setup must be non-intrusive. In most cases, it is not acceptable to connect an emulator to the unit under test, or to connect an oscilloscope probe to the unit under test. The emulator or scope could influence the EMC test results. When the applied EMI causes the unit to fail, it can be very difficult to determine in what way the unit failed.

The embedded software engineer must provide as much debug assistance as possible to reveal what the failure mechanism was. Sometimes the required debug assistance can be quite simple. A highly effective yet simple method is to provide some type of dynamic signal that indicates that the unit is "alive".

An LED, for instance, can work well for this purpose. If this dynamic signal can be changed (in frequency for instance) when the unit has entered an error state then the signal is even more effective. Usually more debug assistance is required though.

When running burst, surge, or ESD testing, real-time status of the unit can be monitored through a wireless communications link (IrDA, 802.11, Bluetooth, etc) if one is available.

If no wireless communications link is available, or when running radiated susceptibility testing in an RF anechoic chamber, you might have no choice but to debug the unit "after the fact". Detailed non-volatile logging of pertinent events can provide valuable clues as to what happened.

If there are communications links, even on board I2C, CAN, or SPI busses, keeping performance statistics that can be queried after the test suite has completed can also indicate a problem area that otherwise might not have been observable. Non-volatile event and statistics logging is preferred since the EMI could lead to a system reset.

If non-volatile storage isn't available, it still might be possible to query statistics, device state, and other pertinent data from the device if it remains running in the failure state after the test suite completes. It may be possible to deduce the method of failure by seeing the result of the failure.

It's quite typical that informal prescreening tests are run prior to executing the formal EMC test suite at a test house. Use this opportunity to find the suspect areas and determine what data needs to be accessible during or after the test. It might reveal previously unforeseen hot spots that require some creative means of providing the clues to the source of the problem.

Providing Automated Execution of Device Functionality
Quite often, the device under test requires user intervention or other stimulus to cause it to execute all of the major blocks of its functionality that might be influenced by EMI. Especially for radiated susceptibility testing in an RF anechoic chamber, it is not always possible to provide these stimuli.

In these situations, one solution is to create a special EMC test build that automatically sequences the unit under test through these major blocks of functionality without the normal stimuli. In this situation especially, the ability to log or visually indicate the progression through functional blocks is critical. If the device fails, you need clues as to what it was doing when the failure occurred.

Figure 4. Spectrum of MCU clock

Minimizing Emissions
Lastly, let's address how the embedded software engineer can influence minimizing the emissions of the product. In some of these techniques, we truly are minimizing the emissions, but in most cases we are further spreading the emissions over the frequency spectrum so that the average peak energy at any specific frequency is minimized.

Let's consider a system that has a switching power supply. Despite the Hardware Engineers best efforts, the emissions at its switching frequency might exceed the standards limits.  (Figure 4, above) Many switching power supply control ICs can be driven by an external clock.

You can drive the clock input of the switching power supply control IC with a clock output of the microcontroller. At the frequencies that most switching power supplies operate at, the emissions are measured in 9kHz bands.

If you implement a frequency hopping scheme for the clock output frequency that spans some multiple of the emissions measurement bandwidth (for instance 3X or 27kHz), the amplitude of that very strong peak is spread to lower amplitudes in the wider band (Figure 5, below).

Figure 5. Spectrum after clock dithering

Most switching power supply control ICs with clock inputs can easily withstand that frequency variance. This is commonly known as spread spectrum clocking (SSC), or clock dithering. This scheme lowers the average value of the peaks of the currents even though the total amount of energy in the waveforms is the same as before.

Software is required to implement this spread spectrum clocking scheme. If your microcontroller has an internal DMA feature, it is possible that this scheme can be implemented with little or no impact on processing bandwidth.

Just configure a DMA channel to continually loop through clock frequency register reload values stored in a constant array. Note also that after reset, the switching power supply control IC can run off its own internal oscillator until the software configures the switching of the micro's output clock.

The spread spectrum clocking can also be extended to synchronous communications busses where your microcontroller is the master of the bus. Consider an SPI bus (a TI McBSP for instance) that can be configured to use an external signal to drive the baud clock.

A microcontroller timer output can be connected as the drive source for the baud clock. This timer output can then be configured for a spread spectrum clocking scheme. As long as the frequency variance on the baud clock stays within the frequency range specification, and doesn't cause setup or hold time issues, data communications should occur without issue, and the peak emissions will be reduced.

Some microcontrollers that provide parallel address/data busses for access to external components provide programmability of the edge rates of the control signals. In these cases, using the slowest required edge rate will minimize the emissions.

When using a serial communications link to transfer data between ICs, or especially if off board, care should be taken to minimize the rate of communications packets. Limit link activity to the least that is required. During the formal EMC emissions testing, the emissions are monitored on an average basis. By limiting the link activity, the average is lowered. I've personally seen this be the difference between passing and failing a test!

Another way to minimize emissions is to disable any microcontroller internal or external clocks that are not being used. Many times internal peripherals will have individual clock enables.

If the peripheral is not used, disable the clock! It will minimize emissions and reduce power consumption too. Many microcontrollers have external clocked outputs that can be disabled too. The ALE output of 8051 family is a good example. Another example is a CLKOUT output of the system clock. If you're not using it, disable it.

Another thing to do is architect your code to utilize blocking tasks (if using an OS) or to utilize interrupt wakeups (for non-OS) to run tasks rather than just having idle loops spinning in the background.

This architecture enables you to better utilize low power processor modes during the idle time between the bursts of activity. This may allow you to shut down a noisy high speed crystal more often, or even allow you to reduce your clock frequency because these designs make better use of processing bandwidth. Usually low power and low emissions go hand in hand.

Joe Brotz is a Senior Design Engineer with Plexus Technology Group in Neenah Wisconsin. He has over 20 years of experience in embedded software design on products ranging from safety light curtains to medical devices. Joe has a BS in Electrical and Computer Engineering from Marquette University. He can be reached at joe.brotz@plexus.com.