Enhancing ARM-based embedded SoC performance in high-bandwidth human-interface applications - Embedded.com

Enhancing ARM-based embedded SoC performance in high-bandwidth human-interface applications

Broadband access has become commonplace, paving the way for a newcontent-rich Web experience that now includes both audio and video. Themicrocontrollers that drive these applications must store, process andmove massive amounts of data at very high rates between theperipherals, the memory and the processor.

For example, digital cameras now have multi-million pixel sensorswith huge bandwidth and memory requirements to process and store thevast amount of data. On the other hand, voice and music require lessbandwidth. However, streaming content adds real-time constraints to thecommunications channel.

While chip vendors have addressed such processing challenges withhigh throughput CPU cores that have DSP extensions, they have not doneenough to accommodate the massive amounts of data that must betransferred between the peripherals, memories, the CPU and any on-chipco-processors.

System developers evaluating microcontroller alternatives for theirdesign, or who are evaluating cores for use in their SoC designs shouldlook beyond raw MIPS. Rather they should verify that controller'sability to move massive amounts of data without gobbling up all the CPUcycles.

It is necessary to evaluate closely architectural alternatives thatoff-load data transfers between the peripheral and the memories with aperipheral DMA controller. Dedicated busses that service on- andoff-chip memory, the CPU and any high bandwidth peripherals will alsohelp eliminate the possibility of bus bottlenecks.

Finally, adding multiple external bus interfaces that allowsimultaneous, parallel processing of data from external memories by theCPU and on-chip co-processors, makes it possible for a system developerto take advantage of the full processing potential of advanced coressuch as the ARM926EJ-S.

The drawbacks of current SoC designs
Before the advent of data-centric applications, the limiting factor inmost applications was the ability of the CPU to process small amountsof data quickly. Recent innovations in controller architectures,particularly the addition of DSP extensions to the instruction set andmuch faster clocks, have overcome the processing challenges.Controllers, such as those based on ARM's 926EJ-S core, can execute ahuge processing load. Unfortunately, communications with, on- andoff-chip memories have not kept pace.

Conventional 32-bit processors directly manage all communicationlinks and data transfers. They first load data, received by aperipheral, to one of their internal registers and then store it fromthis previously loaded register to a scratchpad stored in on-chip SRAMor external SDRAM. The CPU must then process the data and copy it back,through an internal register, to another communication peripheral fortransmission. This Load Register (LDR) Store Register (STR) schemerequires at least 80 clock cycles for each byte transferred.

An ARM9 processor, running at 200 MHz with an internal bus at 100MHz, reaches its limit when a peripheral transfers data at about 20Mbps ” not enough to service an SPI or SSC, much less handle 100 MbpsEthernet transfers (see Figure 1, below ).

Figure1. Traditional ARM Data Transfer Structure

If the memory management unit (MMU) and the instruction and datacaches are disabled, the ARM9 controller is limited to only 4 Mbps, notenough to handle even a high-speed UART. The traditional solution tothis severe bandwidth limitation has been to increase the processorclock frequency, also increasing both power consumption and heatdissipation. However, even the highest available frequency may not besufficient to achieve the required bandwidth of today's applications.

Current applications may integrate high-bandwidth peripherals suchas 100 Mbps Ethernet, 12 Mbps USB, 50 Mbps SPI, a VGA LCD controllerand a 4+ megapixel camera interface. With the advent of thesehigh-speed peripherals, even a 1 GHz processor does not have enoughprocessing power to handle all the data transfers.

At 100 Mbps, the CPUdoes nothing but move data because there simply isn't any processingpower left to do anything else. Thus, although processors can easilyachieve the computational throughput to execute an application, theyare not capable of moving data fast enough. The challenge is no longercomputational; it's bandwidth.

Manufacturers have tried to solve this problem by adding FIFOs totheir on-chip peripherals. Unfortunately, FIFOs do not increasebandwidth, they just lower data transfer peaks by spreading the busload over time. The archaic LDR/STR processor architecture requires theCPU to execute each and every one of those byte transfers, robbing itof cycles needed for processing.

A new approach to processor architecture that includes the use ofsimple, silicon-efficient DMA (Direct Memory Access) inside theindividual peripherals and the addition of dedicated busses betweenhigh-throughput elements on the chip provides a lower cost, lower powersolution to this problem.

The care and feeding of peripheralDMA
The use of DMA is a natural evolution for embedded architectures thathave seen the number of on-chip peripherals and data transfer ratesgrowing exponentially. DMAs solve part of the problem by allowingdirect peripheral-to-memory transfers without any CPU intervention,thus saving valuable CPU cycles. DMAs can transfer data using one-tenthas much bus bandwidth as is required by the processor.

However, DMA controllers are designed primarily for memory-to-memorytransfers. Such DMAs offer advanced transfers modes like scatter-gatherand linked lists that are very effective for memory-to-memory transfersbut are not useful for peripheral-to-memory data transfers. This addsunnecessary software overhead and complexity to the system design.

A better approach is to use an optimized peripheral DMA between theperipherals and the memory. Peripheral DMAs requires 90% less siliconthan memory-to-memory DMAs, making them cost-effective to implementdedicated DMA channels for each peripheral.

Moving the DMA channel configuration and control registers into theperipheral memory space greatly simplifies the peripheral drivers (Figure 2 below ). The applicationdeveloper needs only to configure the destination buffer in memory andspecify the number of transfers. The software overhead is minimal.

Figure2. Optimized Peripheral to Memory DMA to deal with bus bottlenecks

Each UART or SPI, for example, has two dedicated PDC channels, oneeach for receiving and transmitting data. The user interface of a PDCchannel is integrated in the memory space of each peripheral, andcontains a 32-bit memory pointer register, a 16-bit transfer countregister, a 32-bit register for next memory pointer, and a 16-bitregister for next transfer count. The peripherals trigger PDC transfersusing transmit and receive signals.

When the peripheral receives an external character, it sends aReceive Ready signal to the PDC which then requests access to thesystem bus. When access is granted, the PDC starts a read of theperipheral Receive Holding Register (RHR) and then triggers a write inthe memory. After each transfer, the relevant PDC memory pointer isincremented and the number of transfers left is decremented. When thememory block size is reached, the next block transfer is automaticallystarted or a signal is sent to the peripheral and the transfer stops.The same procedure is followed, in reverse, for transmit transfers.

When the first programmed data block is transferred, anend-of-transfer interrupt is generated by the corresponding peripheral.The second block data transfer is started automatically and theprocessing of the first block can be performed in parallel by the ARMprocessor, thereby removing heavy real-time interrupt constraints toupdating the DMA memory pointers on the processor, and sustaininghigh-speed data transfers on any peripheral.

It is possible, at any moment, to read the location in memory of thenext transfer and the number of remaining transfers. The PDC hasdedicated status registers which indicate if the transfer is enabled ordisabled for each channel. Control bits enable reading of the pointerand counter registers safely without any risk of their changing betweenboth reads.

The peripheral DMA frees the host CPU to focus on the computationaltasks it was designed for without wasting cycles on data transfers. Infact, a peripheral DMA controller (PDC), can be configured toautomatically transfer data between the peripherals and memorieswithout any CPU intervention at all. Additionally, the PDCautomatically adapts its addressing scheme according to the size of thedata being transferred (byte, half word or word).

A PDC integrated in a 10-bit ADC configured to operate as an 8-bitwill generate byte transfers and increments its address pointer by 1after each transfer automatically. In 10-bit mode the same PDC willtransfer half words and increment its address pointer by 2.

Effective use of a multi-layer busstructure
Another problem facing data-intensive applications is on-chipbus bandwidth. When multiple DMA controllers and the processor pushmassive amounts of data over a single bus, the bus can becomeoverloaded and slow down the entire system. A 32-bit bus clocked at 100MHz has a maximum data rate of 3.2 billion bits per second (Gbps).

Although that sounds like a lot, in data-intensive applications,there may be so much data that the bus itself becomes a bottleneck.Such is the case with internet radio where audio quality is a directfunction of the ability to receive and process streaming content indefined timeslots, or GPS navigation involving interactive vectorgraphics. This situation can be avoided by providing multiple, parallelon-chip busses and a small amount of on-chip scratchpad memory (see Figure 3, below ).

Figure3. Multiple layered bus structure

External Bus Interface
When an application shares external memory between theprocessor and peripherals, the external bus interface limits thebandwidth. The next step to increase bandwidth is to provide twoparallel external bus interfaces connected to the internal multi-layerbus: one for system memory and one that supports a high-speedperipheral or co-processor. In embedded applications with man-machineinterfaces, the required amount of memory is so huge that it is notcost-effective to put it on the controller.

For example, a 24-bit color VGA panel requires a frame buffer of 900KBytes. An LCD controller with this much SRAM would be prohibitivelyexpensive so the frame-buffer must be stored in external RAM. Therefresh rate is typically 60 frames per second. With a VGA (640x480pixels) panel in 24-bit true-color mode, the CPU needs to fetch 7.2Mbits of data 60 times per second, or 432 megabits per second (Mbps). Aconventional 200 MHz ARM9 processor cannot possibly achieve this levelof throughput.

Bandwidth can be increased by adding a second EBI and a 2-D (orother) graphics co-processor. (SeeFigure 4, below ) The second EBI is connected to a secondexternal memory that is used as an LCD controller frame buffer which isdirectly connected to the on-chip 2-D graphics co-processor thatoffloads line draw, block transfer, polygon fill, and clippingfunctions from the CPU. The performance gain achieved from a secondexternal bus interface is application dependant but can be expected tobe in the range of 20 to 40%.

Figure4. Dual External Bus Interfaces

This type of architecture is appropriate for data-intensiveapplications that have a graphical human-machine interface, such asnetworked medical monitoring equipment and GPS navigation systems.

By integrating 18 simple, silicon-efficient, single-cycle,peripheral DMA controllers (PDC), five DMA controllers with burst modesupport to the USB host, Ethernet MAC, camera interface, LCD controllerand 2D graphics controller, plus a memory-to-memory DMA controller withburst mode, scatter-gather and linked lists support, this architecturalapproach can off-load, from the CPU, the execution of data transfersbetween the peripherals and memories.

While a conventional ARM9 is overwhelmed by a 20 Mbps data rate, anARM9 with sufficient peripheral DMA can easily handle the datatransfers with 88% of its MIPS available for application execution.

Multi-layer Bus plus Generouson-chip SRAM
Traditional 32-bit processors with a single 100 MHz bus, have a maximumon chip transfer rate of just 3.2 Gbps to handle all instructions andall data shifted back and forth between the on- and off-chip memories,CPU and the peripherals.

Although it sounds like a lot, 3.2 Gbps may not be enough to supportthe massive amounts of data, intensive processing, and real timerequirements of a system with an interactive human interface.

By implementing multiple dedicated busses between the peripherals,processor, data and instruction memories, plus ample of on-chipscratchpad SRAM, streaming content can be received and processed indefined timeslots, avoiding bottlenecks that can occur in a single-busarchitecture. The SRAM can be partly configured as tightly-coupled dataand instruction memory (TCM). Multiple busses provide multiple parallelon-chip data transfer channels, ensuring that a single peripheral doesnot overwhelm the bus arbiter (SeeFigure 5, below) .

Figure5. A typical Peripheral DMA enhanced ARM with multiple buses.

A typical eleven bus ARM9 (see Figure3, earlier ) would have sevenbusses dedicated for the DMA controllers and their Ethernet MAC, USBhost, Camera interface, LCD controller, 2D-graphics co-processor, the2-channel memory to memory DMA controller and an 18-channel peripheralDMA controller (PDC).

Other busses might be dedicated to on- and off-chip memory. Twoadditional busses, one for data and one for instructions, can connectthe processor with the tightly coupled memories. Finally, two bussescan be used to connect the instruction and data cache controllers tothe memories.

Once the memory address and block sizes are configured, the DMAstransfer data automatically. No additional programming is required.When two DMA's and/or the processor access the same memory, an arbitercontrols the access using 1) round robin, 2) fixed or 3) default master arbitrationschemes, as selected by the programmer.

The graphics in 2-D man-machine interfaces require nearly a GByte ofexternal memory for the frame buffer alone, plus a 432 Mbps data ratejust to refresh a 640 x 480 24-bit LCD (24-bit true-color mode). Therequired bandwidth is out of reach for conventional ARM9s.

The use of two external buses readily solves this problem: one forthe system memory and one for the human interface. The second EBIshould have dedicated busses to both the on-chip 2-D graphicsco-processor and the LCD controller. This second EBI eliminates theneed for the LCD controller and CPU to share memory, and can increaseavailable CPU MIPS by 20% to 40%.

Some ARM-based controller vendors are employing these techniques tomeet the growing need for realtime data stream processing with humaninterfaces. A variety of ARM7 and ARM9-based microcontrollers areavailable today that allow high data rates and maximum CPU throughput.

Many ARM9s have multiple dedicated busses for the CPU instruction,data cache controllers, as well as all high- bandwidth peripherals.Depending on the number of on-chip peripherals, ARM9-based MCUs areavailable today that have between five and eleven independent 32-bitbusses, and a maximum on-chip data rate of between 16Gbps to 41.6 Gbps.Finally, ARM-based controllers with dual external bus interfaces (EBI)support can support intensive graphics processing or large databuffers.

These architectural enhancements can enhance the ARM9's performanceso that 20 Mbps data transfers that would overwhelm a conventional ARM9can take place continuously with 88% of the processor's cycle availableapplication execution. Providing separate memories for the CPU and PDCcan increase the processor's available MIPS to 100%.

The combination of an eleven-layer bus, dual EBIs and peripheral DMAcontroller allow an ARM9 with LCD controller to refresh the 320 by 480VGA screen 60 times a second with 100% CPU cycles still free for otherfunctions!

This relatively simple, silicon-efficient addition of DMA busses andexternal memory interfaces to the microcontroller architecture turns aprocessor that effectively has no MIPS for application execution intoone that can transfer all the data and still have 200 MIPS remainingfor applications execution.

Jacko Wilbrink is an ARM marketingmanager, and Dany Nativel is thetechnical product marketing manager for ARM-based MCUs at AtmelCorp.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.