Design considerations for harsh environment embedded systems

Designing embedded systems that operate reliably in harsh environments presents a unique set of challenges. Consumer electronics products such as cellphones can withstand drops, knocks and some models can now be submerged in water and still operate. Impressive indeed, but still child’s play compared to operating 5 km down an oil well or orbiting earth in a satellite.

Embedded systems that operate in down-hole drilling, jet engine controls, space and other high radiation environments are all subjected to an environment that is not forgiving to commercial electronics. For these types of applications and others with similar challenging conditions, special attention must be given to system design, circuit design and component selection. The embedded system will be exposed to temperature extremes, mechanical shocks, electrical shocks and bombarded with ionizing radiation that can latch-up any standard CMOS chip on the market today.

All CMOS chips go through a ‘Qualification’ process before they are released to the market. An industry standard qualification will ensure that the device can operate across a reasonable temperature range, withstand ESD events, a certain level of humidity and can still operate for a reasonable expected lifetime. JEDEC publish standards that provide a consistent level of stress testing throughout the industry. The Automotive Electronics Council requires a more stringent level of qualification for devices that need to demonstrate higher reliability and operation in harsher environments (typically 125˚C for auto, compared to consumer grade that would be expected to operate up to only 85˚C).

Harsh environment embedded systems require components that will operate well above even the automotive specifications. A major headache for designers is that there is a very limited pool of products to choose from when selecting an IC that must operate at high temperature or must withstand a high total ionizing dose of radiation. Look for yourself – how many ARM Cortex-M microcontrollers can you find that are specified to operate at 200˚C?

Designers get around this problem by either ‘up-screening’ a COTS (commercial off-the-shelf) component or choosing a device that has been designed-for-purpose. There are obvious concerns with up-screened COTS in that the device will be operating outside the specification that it was designed for and is not recommended or guaranteed by the original manufacturer for use. A ‘designed-for-purpose’ device is a safer option, but the pool of components to select from is limited. Often these components are developed using a more exotic process (such as silicon-on-insulator) so will likely be more expensive. Note that even ‘commercial grade’ products that have been up-screened will be expensive – up to one hundred times the price of a comparable 85˚C part.

Electrical issues arising in harsh electronics environments
Harsh environments throw up a multitude of electrical issues for embedded systems designers. ESD, voltage transients, high temperature induced carrier-creation and ionizing particle strikes can all be expected. All of these electrical issues can cause a CMOS device to malfunction or be destroyed by latch-up.

Latch-up is a phenomenon that is widespread across the industry because every bulk CMOS device contains millions of parasitic transistors that are created due to the structure of complementary metal oxide semiconductor device architecture. A pair of parasitic bi-polar junction transistors are shown superimposed on the CMOS device cross section in Figure 1. The BJTs can be switched on by a voltage transient, high temperature carrier effect or a particle strike (expect this in space or high radiation environments). When one of the transistors becomes forward biased and switches on, it will drive the other transistor, creating a low impedance path between Vdd and Vss. Latch-up will result in drawing excess current, disrupting circuit behavior and can destroy the device quickly unless the condition is reset (power cycling is required to reset a latch-up condition).

Under normal circumstances, when the device is operating within specification, latch-up should not occur. In a harsh electrical, radiation or high temperature environment, latch-up will certainly occur unless precautions are taken by the IC designer.

Figure 1 – Parasitic BJTs that exist in CMOS devices . (Source: Author)

Next Page >>

There are a few different causes of ESD (Electrostatic Discharge) that are well understood and all result in a flow of electricity between charged objects. There are also well developed testing models and precautions. It remains a major consideration for embedded systems designers on account of the seriousness of the effects – often permanent damage to components. At a system level, diodes can sometimes be used to protect chips from ESD and off-chip transients by shunting some of the energy. Care must be taken to select the fastest possible clamping diodes (which can be relatively slow compared to an ESD event that occurs in typically 5-50nS. Clamping diodes will not help however with high-temperature carrier creating and ionizing particle strikes. To immunize against latch-up, a more exotic process technology such as silicon-on-insulator could be used, or a standard CMOS chip that has been hardened with a buried guard ring (to prevent the parasitic transistors from becoming forward biased) is recommended.

Another measure that can reduce the risk of ionizing particle strikes and electromagnetic waves is radiation shielding. There are many different types of particle radiation and electromagnetic radiation that can cause single event upsets (SEU). Each type of radiation has unique characteristics, so the type of shielding that is most effective depends on the energy level or wavelength properties of the radiation type in question. Cosmic rays are a problem at high altitude and have been suspected of being the root cause of electronics systems in commercial aircraft. X-rays are commonly shielded by lead screens. All shielding unfortunately has the effect of increasing size and weight of the system, which usually need to be minimized.

Shielding is a ‘fault avoidance’ technique. Both fault avoidance and ‘fault tolerance’ are often implemented at both a system level and a component level. An example of a fault tolerant circuit on an electronic component is an Error Detection and Correction (EDAC) circuit on a memory device. This type of circuit will use a parity scheme or other error detection coding technique to ensure that a word that is read from memory is consistent that with word that was originally written to that location. This approach is common on large memory arrays that can be subject to single event upsets that result in ‘flipped’ (incorrect) bits. If a flipped bit is detected, the original correct state of the bit will be re-written into the appropriate location. EDAC systems are often limited to single bit correction and double bit correction. The overhead for this type of scheme is significant (it takes five parity bits per byte to detect two errors and correct one of them). Multiple errors from a single event upset are extremely difficult to correct. For this reason, IC designers for devices that are intended to be used in harsh environments will physically distribute memory cells on a die, such that logically adjacent bits are spaced out and are less likely to be exposed to an upset event.

Fault-tolerance is an added layer of sophistication that is intended to ensure predictable system functionality in situations where faults cannot be avoided. A typical fault tolerance technique at the electronic component level is redundancy. A simple example of this is to replace every critical component with three identical units. An action will only be taken if two of the units ‘agree’. This allows safe operation of the system in the event that one of the units develops a fault. Redundancy can be implemented with multiple components and also at device level with redundant circuits on the chip. ‘DICE’ (Dual Interlocked storage Cells) latches can be used to immunize against single event upsets with triple modular redundant (TMR) circuits that replace transistor level switching mechanisms with a voting block.

In addition to fault-tolerance and fault-avoidance techniques, a fail-safe mechanism is good practice. Harsh environment electronics systems are typically expensive and downtime should be minimized. The mechanical system itself is expensive along with the cost of lost production output. For one-off type systems like satellites, failure is not really an option when considering the enormous cost of developing and launching the mission. Often, designers' first questions about components are regarding how long the part will survive under extreme conditions and what is likely to fail first. In the event that even a redundant system fails, an FMEA (Failure Mode and Effects Analysis) should detail how the system behaves under all reasonable catastrophic conditions. This also relates to and extends to the mechanical design. A good example is when the electronic control unit of an anti-lock brake system fails, the hydraulic valve design ensures that the brakes are not automatically enabled.

Embedded systems for harsh environments deploy both fault techniques widely. State-of-the-art ICs include both fault avoidance (EDAC) and fault-tolerant (TMR) circuits on chip. An example of a radiation hardened microcontroller that has both EDAC and TMR is shown in Figure 2. In addition to the TMR and EDAC, the design is implemented using a memory compiler (that builds the memory array on the chip) that ensures that logically adjacent bits are not physically adjacent. This reduces the risk of exposure to a single event upset that cannot be detected by the EDAC. Metal migration is also a concern in high temperature environments that is mitigated by conservatively spacing traces to limit current density. This approach alone can add years to the life of a product.

Figure 2 – Block diagram of microcontroller with Error Detection and Correction sub-system and Triple Modular Redundancy for extreme radiation environments. (Source: Author)

The safest approach to developing an embedded system that can operate reliably in a harsh environment is to select components that are specified to operate within the expected conditions of the environment, then protect the system with fault-avoidance and fault-tolerance. The biggest challenge to this approach is usually that the choice of components that are specified for extreme conditions. Increasing higher temperature ranges narrows the choice of available components.

PCB design also has a significant role. In order to distribute heat, high-dissipation components need to be thermally decoupled from more temperature sensitive components (as should sensitive PCB traces be physically located away from ‘noisy’ ones such as clock signals). Shielding traces (or ‘guard’ traces) should be employed next to noisy signals. This is as simple as a grounded trace running parallel to the noisy signal trace, although it can be enhanced in its effectiveness by making the trace deeper to make it more difficult for an electromagnetic field to flow over it. Thick traces are preferred.

High temperature PCBs are manufactured using ceramic material rather than the laminate materials used for commercial products (because extreme heat can cause delamination). Glass reinforced polyimide material has been used effectively on high temperature PCBs but at temperatures exceeding 200˚C, you can expect that the PCB traces will lift. Referring to Figure 3, note the discoloration of the test PCB on the right hand side that occurs early on when a polyimide PCB is exposed to high temperatures in a test oven. The solder mask has already begun to disintegrate after only a few hundred hours.

Figure 3 – Early high temperature effects show discoloration and solder mask disintegration on a polyimide-based PCB. (Source: Author)

Another common PCB design technique that is used is to implement ground planes in multi-layer PCBs which helps isolate noise and reduce SNR (signal-to-noise ratio) but adds cost as the PCB is more expensive.

Designers of embedded systems for harsh environments will avoid using fans for cooling systems if possible due to reliability concerns. Using components that are specified for high temperature operation along with heat sinks with an appropriately large surface area, thermal vias and materials with good thermal conductivity are preferred.

Mechanical issues arising in harsh electronics environments
Although mechanical problems are most obviously thought of as a result of a mechanical shock such as vibration, in fact major mechanical issues arise as a result of thermal cycles. Matching thermal expansion coefficients of materials in the system and ICs is very important.

A high temperature plastic package is shown in Figure 4. It is now possible to use specialized plastic packages up to 200˚C. Ceramic packages are typically used for high temperatures and die on a ceramic substrate is preferred for extreme temperatures (over 250˚C).

An IC package contains many different materials and layers in the package, substrate and encapsulation materials that are connected to each other. Each material has a different thermal expansion coefficient. As temperature changes, these materials expand and contract at different rates and the interconnections between them are stressed. There are also interactions that occur between different materials at high temperature.

At high temperature, any moisture that is present in the package will vaporize and can create ‘pop-corning’, so called on account of the physical resemblance. Similarly, high temperatures can cause package decomposition, sometimes called ‘outgassing’. Both effects will result in a loss of package integrity and loss of seal, sometimes also cracking the die. Because of the difference in thermal expansion coefficients of the materials within the package, bond wire can be lifted and other delamination effects (separation of different layers in the package structure, for example with the die attach epoxy) can occur.

Ceramic packages are typically hermetic and are less problematic at higher temperatures.

Figure 4 – High temperature plastic LQFP package construction. (Source: Author)

Note also that the conductive materials in the package are complex. Copper bond wire is coated with palladium. The copper leadframe is coated with silver inside and matte tin outside. The reason for the complexity is to ensure the most compatible connections are possible that are matched for thermal expansion, provide good bond strength and avoid inter-diffusion. When gold is inter-diffused (at high temperature) with aluminium, ‘Kirkendall voids’ can occur that form a diffusion front, reducing the strength of the mechanical join as well as the electrical connection.

One of the functions of a package is to provide a way to channel heat away from the silicon die. It is always possible to identify components that generate a lot of heat as they will likely have large finned heat sinks attached to them with a layer of high thermal conductivity material to help effectively couple the heat. In fact, what we commonly refer to as a heat sink is really a heat-spreader. A true heat sink is a device or substance such as ice water that will literally ‘sink’ heat without its temperature rising (until the ice melts — then the temperature of the water will rise). Wider PCB traces will also help to conduct heat out of the package.

There have been great achievements in developing electronic systems that operate in harsh environments. The electronic systems that have been operating reliably in space, downhole-drilling and even automotive under-hood applications have proved this. As with all engineering problems there are lots of trade-offs to be considered. Using radiation-mitigated architectures that include advanced technologies such as buried guard rings, TMR and EDAC has proven to be effective at a chip level and attention to packaging technology (or using die for high temperature systems) has also led more reliable systems.

The limiting factor that must always be considered is budget. It is generally true that you get what you pay for. The more money that is available to invest in components that were developed and tested to operate in extreme environments, the more likely the system is to operate as expected. There are always good practices that can be employed when designing systems for high reliability in harsh conditions but there is no substitute for using the right tools for the job – that means ensuring that each device is operating within specification. Experience with developing high temperature test boards and running them in an oven at extended temperature for thousands of hours has repeatedly shown that the components that fail and bring the system down are the commercial grade parts that are being used beyond their spec.

One last piece of advice is to pay careful attention to supplier’s recommendations and guidelines for component use. Even parts that are specified for high temperature use (for example a capacitor that is rated for operation at 250˚C) will fail if the guidelines for preheating prior to soldering the device are not followed. Often the bulk of system failures are due to not paying attention to details like correct installation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.