Pushing performance limitations in microcontrollers - Embedded.com

Pushing performance limitations in microcontrollers

This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.

Today's microcontrollers need to perform a wide range of tasks, including managing real-time control algorithms, decoding high-speed communications protocols, and processing signals from high-frequency sensors. Polling methods, such as checking interface ports to see if new data has arrived yet, consume too many CPU cycles as well as often have too high a maximum response time to reliably service I/O and peripherals. For most embedded applications, developers rely upon interrupts to meet their real-time requirements for managing peripherals.

Interrupts, however, can only determine when a real-time event has occurred. Developers must still directly involve the CPU to read I/O and peripherals before data is lost. Handling an interrupt requires potentially interrupting other latency-sensitive tasks, incurs context switching overhead, and introduces a wide range of esoteric challenges such as managing latency when multiple interrupts occur concurrently, all of which reduce predictability and processor efficiency.

To be able handle the high data rates and frequencies of real-time I/O and peripherals, microcontrollers must achieve higher processing efficiency. This efficiency, however, needs to be founded not on increased clock frequency (which comes at the expense of higher power consumption) but through internal changes in microcontroller architectures. Specifically, microcontrollers have begun to integrate coprocessors which offload specific task blocks, multi-channel DMA controllers for facilitating penalty-free memory access, and integrated event systems which route signals between internal subsystems to offload I/O and peripheral management.

More than one way to offload a CPU

Integrated coprocessors have become fairly widespread in a wide range of embedded microcontrollers. Among the more commonly recognized coprocessors are encryption and TCP/IP offload engines. Effectively, coprocessors offload entire tasks or assist in the more computer-intensive portions of complex algorithms.

For example, an encryption engine reduces AES computations on the CPU down from 1000s of cycles to 100s of cycles per operation while a TCP/IP offload engine make it possible to terminate an Ethernet connection with little CPU overhead. In addition, offload engines simplify implementation of these tasks, eliminating the need for extensive driver code creation by allowing developers to access this advanced functionality through the use of simple APIs.

DMA and event system technologies are less familiar to developers and often not used for this reason. DMA controllers offload management of data movement from the CPU by performing data accesses, such as peripheral registers to internal or external SRAM in the background. For example, a developer can configure the DMA controller to preload a block of data into on-chip RAM so that it is available for fast access before the CPU needs it, thus eliminating wait states and dependency delays. Alternatively, a DMA controller can assume most of the burden of managing communication peripherals ” see table 1.

Table 1: DMA Controllers can assume most of the burden of managing communication peripherals.

The savings in cycles from using a DMA controller can be significant: many embedded developers have found themselves unable to fit an application within the resource limits of a microcontroller, only to have the manufacturer introduce them to DMA and suddenly find themselves with extra cycles available, sometimes on the order of 30 to 50 per cent across the system.

It is only when they face a processing wall that many developers first discover the untapped potential that has been available to them from the start.

Even fewer developers are familiar with event systems which work in conjunction with DMA controllers to further offload CPU cycles as well as reduce overall power consumption. An event system is a bus that can connect internal signals from one peripheral on a microcontroller to another. When an event occurs at the peripheral, it can trigger action to be taken in other peripherals without involving the CPU and within a two-cycle latency. much the way the human body processes reflexes like pulling a person's hand out of a fire without having to first consult the brain.

Specifically, an event system routes signals throughout the microcontroller using a dedicated network connecting the CPU, data bus, peripherals, and DMA controller. Normally peripherals must interrupt the CPU to initiate any action, including reading the peripheral. By routing events directly between peripherals, the event system in effect offloads these interrupts from the CPU. Developers have the flexibility to configure peripherals to follow different event channels, thus defining the particular event routing required to meet the specific needs of the application.

The combination of DMA working with the event system enables developers to offload entire tasks, much like a coprocessor does. One key difference is that coprocessors are not programmable. They implement a well-defined task in hardware and at best are configurable. The programmability of a DMA controller and event system make them appropriate for a variety of tasks, from the most simple to the very complex. In the case of using DMA with an event system, the DMA manages the transfer of data throughout the microcontroller's architecture while the event system controls when these transfers occur with low latency and a high degree of accuracy. Put another way, the event system makes sure the values managed by the DMA are sampled or output at the right time/frequency.

Figure 1 shows a block diagram of how an event system and DMA work together. The ADC is connected to a sensor and will collect samples. An internal counter is set to match the sampling frequency, providing regular and accurate intervals. Rather than interrupt the CPU to sample the ADC, the event system directly initiates sampling of the ADC. This results in the sampling frequency being extremely accurate relative to the microcontroller's clock. When the ADC settles and the conversion is completed, the ADC then triggers the DMA to store the value through the event system.

Fig 1: A DMA controller and event system work together to offload peripheral processing from the CPU. An internal counter sets the sampling frequency, providing regular and accurate intervals or an input signal (event 1) could trigger an ADC to sample (event 2) and save the value to DMA (event 3) until the DMA buffer is full (event 4). In this configuration, the CPU is only interrupted once there is a full buffer of data for it to process.

Event management can be extended to include multiple events, connecting several peripherals to create more complex configurations. For example, an input signal (event 1) could trigger an ADC to sample (event 2) and save the value to DMA (event 3) until the DMA buffer is full (event 4). In this configuration, the CPU is only interrupted once there is a full buffer of data for it to process.

Both DMA controllers and event systems also support multiple channels. This allows developers to configure an interconnected fabric that operates in parallel to the main CPU. As a result, multiple concurrent real-time tasks can be coordinated in a deterministic fashion.

Determinism and latency

Determinism plays a key role in limiting latency and managing the responsiveness of real-time embedded systems. The more deterministic the system, the more consistent its responsiveness will be. A primary factor affecting determinism is how many interrupts a system must handle concurrently. In general, increasing the number of interrupts in the system will erode its determinism.

Consider a system with a single interrupt that completes within 50 cycles. Latency for such an interrupt, then, is consistently on the order of 50 cycles. Note that even the simplest interrupts take on the order of 50 cycles by the time the microcontroller saves the context for a limited number of registers, accesses a peripheral, saves the data, restores the context and suffers a pipeline flush.

However, it isn't the prospect of handling a single interrupt that creates the most problems for developers in terms of determinism and latency. Rather, the hard challenge in meeting real-time deadlines arises when many interrupts happen at the same time. For example, if a higher priority interrupt is introduced to the system that completes within 75 cycles, the latency of the first interrupt is affected since the higher priority task can interrupt. Latency now ranges from 50 to 125 cycles for the lower priority task.

As more interrupts are introduced, the latency for lower priority interrupts increases as determinism drops. A 50-cycle task might be interrupted multiple times and require on the order of 100s to 1000s of cycles to complete. This factor is important because not all interrupts can be high priority, relative to each other.

Determinism directly affects responsiveness, reliability, and accuracy. If a developer knows latency is fixed at 50 or 500 cycles, this can be taken into account during processing. However, when latency can vary from 50 to 500 cycles, the best developers can do is assuming a typical latency such as 200 cycles and bear any variation as error. Additionally, worst-case latency may begin to approach real-time deadline limitations and threaten system reliability.Reducing the potential number of concurrent interrupts ” even low frequency ones ” through a DMA controller and event system can substantially increase system determinism as well as lower latency. Higher determinism also results in important factors such as higher accuracy.

To understand how determinism affects accuracy, consider the implementation of a power supervisory task to maximize AC power efficiency when driving a heavy load such as a motor. As the most energy is available when the voltage is at its peak and in phase with the current, this is when the system should draw the most current.

Conversely, the closer the voltage gets to zero (i.e., the zero-crossing point), the less power that is available and so the less efficient the current draw. By implementing power factor correction (PFC), power efficiency is improved by switching in and out large capacitors that will adjust the load to keep the AC current and voltage in phase.

Typically a comparator is used to detect the zero-crossing. When the voltage drops below or rises above a set threshold, the comparator toggles. Instead of the comparator triggering an interrupt and forcing the CPU to switch the capacitors, the event system can route the comparator event directly to the timer/counter output controlling the switch without CPU intervention.

Interrupt latency for a low priority task like PFC could run into thousands of cycles, depending upon how many higher priority interrupts occur concurrently. Higher latency means the capacitors are switched later than the optimal moment, reducing overall efficiency by a significant amount. Latency from event routing, in comparison, is at most two cycles.

Consider these numbers over the microcontroller's clock rate.

If the microcontroller is clocked at 32 MHz, two-cycle latency introduces negligible error (2/32M). On the other hand, latency in the thousands of cycles could materially affect the accuracy of high-frequency tasks which themselves must be processed every few thousand cycles. Note that this latency could be reduced to the order of 50 cycles if the interrupt were made a higher priority task. However, this leads to assigning priority based on accuracy requirements rather than importance of the function to the system. It also merely shifts the inaccuracies from lack of determinism to other tasks.

Higher accuracy also plays a key role when generating signals, not just sampling them. Consider creating a 100 kHz waveform. Using interrupts, the accuracy of the waveform will be affected by the variable latency relative to the signal rate, slightly slower or faster, based on context switching and the other interrupts that have piled up. Note that while the waveform will be accurate on average, what matters in many cases is the relative difference between two consecutive samples.

High-frequency signal processing

Generating signals is becoming a more common task in a wide range of embedded applications. Signals are used to generate sound, manage a voltage converter regulator, control actuators in industrial applications, and serve myriad other functions. The higher the frequency of the signal, the higher the load on the CPU when interrupts are employed and the greater the potential of increased latency for other tasks.

For events with a higher frequency of occurrence, CPU load becomes a major consideration. For example, a high-speed sensor needs to have samples collected before the next sample is ready in order to prevent loss of data. Considering that a flow meter, multi-axis positioning system, or instrumentation system collecting 2 Msamples/s for fast and accurate measurements will consume tens to hundreds of millions of cycles each second just to collect the samples.

With an event system and DMA controller, all of these cycles are offloaded from the CPU to actually process the samples, not simply buffer them. Even assuming a simple task that requires only 50 cycles to complete the task with context switching overhead, this result in offloading 100 Mcycles from the CPU. For this reason, many systems may use a separate microcontroller to manage individual high-frequency sensors or motors.

For higher frequency tasks, an event system and DMA controller also enables:

Accurate time-stamping: Time-stamping samples enable developers to better synchronize signals to external events. With two-cycle latency, time stamps are far more accurate than those marked by interrupts that have latency on the order of thousands of cycles.

Oversampling: One way to increase sensor resolution is to oversample. For example, dividing a counter by 16 will result in 16 times the number of samples collected, increasing the overall accuracy of the sensor. Because the CPU is not directly involved in taking and storing samples, it becomes possible to oversample without much penalty.

Dynamic frequency: Certain applications required higher sensing accuracy only at certain times or under specific operating conditions. For example, a water meter could sample faster when the flow rate is changing quickly and scale back when the flow is turned off or consistent. Sampling frequency is easily adjusted without impacting real-time responsiveness.

Reduced stack size: An additional effect of reducing the number of concurrent interrupts is the ability to maintain a smaller stack. As each interrupt must perform a context save by pushing potentially dozens of registers to the stack, eliminating several layers of context saves significantly reduces the size of the stack required. This could lead to being able to use less RAM memory for the application.

Immunity to Scaling: Given that different microcontrollers support a different number of peripherals, the number of interrupts can vary in an application across its price line. Even though the same microcontroller family is in use, a higher-end system which supports more functions will have more interrupts, degrading overall determinism. Thus, migration of a design to a more integrated microcontroller could impact signal latency and consequently accuracy, both for sampling and outputting.

Easy software changes: By having event handling that eliminates CPU intervention, software changes can be done without impact real-time responses. Even if more CPU time is needed to handle additional functions, the event handling and response time will remain exactly the same. Without this it can be difficult to implement changes during a product's life time for real time applications. There are myriad tasks an embedded microcontroller could perform to reduce power consumption, increase accuracy, and improve the user experience. Many such tasks are but simple monitors, checking only a single value. For example, a battery monitor watches until the voltage drops below a certain level. Then the system triggers a shutdown to save application data while there is still enough power to do so.

Improving the user experience is often a key differentiator many consumer products. For example, an event system enables faster responsiveness to a wake-up keystroke or peripheral input, enabling reaction within 2 cycles. Compare this to the responsiveness of using an interrupt. Interrupts also require the system to return to active mode, lowering power efficiency. For this reason, developers often extend timer intervals, reducing responsiveness.

Using interrupts, the cost of implementing such tasks has been too high in terms of CPU processing required, added latency, and decreased determinism. With an event system and DMA controller, developers have the ability to implement such features while effectively bypassing the CPU. This not only reduces the number of interrupts the system must manage but also simplifies task implementation and management.

For example, consider an application that will play a warning message to a user under specific operating conditions. The pre-set sound file can be stored in a buffer and fed to the speaker through the appropriate peripheral using DMA. The event system, using a timer, will ensure that the data is applied at a rate of exactly 44,056 KHz. As a side benefit, because the frequency is accurate and consistent, sound fidelity is increased as well. From a performance standpoint, once the DMA and event system have been configured, the CPU is completely uninvolved in the playback task.

To say such tasks become “free” would be an overstatement. Implementing them in this fashion, however, makes them feasible in a much wider array of applications. The combination of coprocessors, a DMA controller, and an event system frees up a controller to handle just the processing of a signal rather than have the majority of its resources consumed in the signal's cycle-intensive acquisition. As a result, the CPU retains much of its processing capacity for signal processing.

In this way, it becomes possible for a single controller to manage several high-frequency tasks rather than just one. This also simplifies system design, permits more tasks to be implemented at a lower price on a single microcontroller, enables easier correlation between multiple signals, and improves power efficiency.

For many applications, the ability to support multiple tasks can lead to important product differentiation. For example, a motor control application utilizing a DMA controller and event system could free up enough resources on the microcontroller that developers could implement advanced features such as PFC without increasing the system BOM.

In addition to increasing the performance capacity of a microcontroller by offloading interrupts, the event system can also reduce power consumption by a factor of up to 7X, depending upon the application. Table 2 shows the power figures for an application requiring 1.2 Mcycles/s. At 12 MHz, the microcontroller is in active mode 10 per cent of the time and standby mode the rest.

Implementing a DMA controller and event system offloads the number of cycles the CPU must execute each second, enabling the microcontroller to drop into idle sleep mode. Given that the current draw in active mode is substantially greater than when in idle sleep mode, even a few percentage changes in active mode results in significant power savings.

Table 2: An event system and DMA controller not only increase CPU capacity and performance, they can also significantly reduce power consumption, depending upon the application, by enabling the microcontroller to drop into idle or sleep mode more often. Given that the current draw in active mode is substantially greater than when in idle or sleep mode, even a small percentage change in active mode results in significant power savings.

Embedded microcontrollers continue to improve performance through architectural changes that improve overall CPU capacity. Coprocessors offload well-defined, compute-intensive tasks from the CPU, DMA controllers relieve the CPU from moving data throughout the system, and event systems eliminate bottlenecks associated with multiple and frequently-triggered interrupts.

By reducing the number of concurrent interrupts the system must handle, developers can increase system determinism, leading to lower latency, improved signal resolution and accuracy, higher consistency and predictability, and greater system reliability. As a result, a single microcontroller can perform the work of multiple, older microcontrollers without reduced system cost and power consumption.

Kristian Saether (kristian.saether@atmel.com) is product marketing manager AVR at Atmel in Norway

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.