As higher-performance 32-bit processor cores begin to make large gainsinto the microcontroller (MCU) space currently dominated by 8- and16-bit devices, chip architects are facing similar challenges in systemdesign that PC designers faced about a decade ago.
While the speed and performance of the new cores has increased, some ofthe key supporting technologies have not kept up, resulting in severeperformance bottlenecks.
Most microcontrollers rely completely on internal memory devices oftwo types. Moderate amounts of SRAM provide the required data storagespace, and NOR FLASH provides the instruction and constant data space.
Embedded SRAM technology is keeping pace with the increase in bothsize and operation speed of the new 32-bit cores. Mature SRAMtechnology is easily available in the 10ns (100 MHz) operational rangeand is cost-effective at this speed grade for the typical RAM sizesrequired by microcontrollers.
But standard NOR FLASH is lagging behind the basic 32-bit core clockspeed by almost an order of magnitude. Current embedded NOR FLASHtechnology is sitting at around 50ns (20 Mhz) access times. Thisintroduces a real bottleneck in the ability to transfer data betweenthe FLASH device and the core, since the core can waste several clockcycles waiting for the specific instruction to be retrieved by theFLASH memory.
This performance gap between processor core speed and FLASH accesstimes is compounded by the standard microcontroller execution model—XIP(eXecute In Place).
Application fault tolerance and the cost of SRAM in larger memorysizes are two major reasons why executing directly from FLASH ispreferable. Programs stored in FLASH are far less likely to becorrupted by random errors in the system, such as power rail glitches.Executing directly from FLASH also removes the need to supply the MCUdevice with enough SRAM to allow the application to be copied from aROM or FLASH device into the targeted RAM execution space.
While improving FLASH technology so it matches the performance of32-bit cores would be ideal, current technology prevents this. Thereare, however, some efficient methodologies the architect can employ tounclog the performance bottleneck.
Simple instruction pre-fetch buffers and i-cache systems placed into32-bit MCU designs can have a profound effect on improving MCUperformance. Following is a description of how system architects canemploy these techniques when upgrading their MCU architecture from16-bits to a 32-bit core CPU.
|Figure1. Introducing a 32-bit core into an existing design|
Introducing a 32-bit core into a16-bit MCU design
Figure 1 above introduces thebasic 32-bit core upgrade to an existing 16-bit design, showing thebasic connections between the new 32-bit core and its basic peripheralset. Since we are discussing integration of a new 32-bit processor coreinto a new microcontroller design, we will assume the followingspecifications of the new 32-bit core to be used.
#1: ModifiedHarvard Architecture. As is common in microcontrollers, the new32-bit core is of a modified Harvard Architecture, such that programmemory and data memory spaces are implemented on two separate busstructures.
While a pure Harvard machine prevents data from being read from theprogram memory space, the modified Harvard Architecture design of thiscore allows this operation. Conversely, this 32-bit core design alsoallows program instructions to be executed from data memory space.
The program and data memory interfaces allow insertion of waitstates in a standard bus cycle, allowing slower memory or memory mappeddevices time to respond
#2: OperatingFrequency. The maximum operating clock frequency of the new coreis to be 120 MHz, a factor of six times faster than the 16-bit core itis replacing.
#3: InstructionMemory Interface. The interface for the instruction memorysystem presents a data bus that is 32 bits wide, along with a 20-bitwide address bus for a total addressable space of 1MB. While 32-bitcores generally have much larger address spaces, this is adequate forthe target application space of this MCU. The standard control signalsalso provide the ability to insert wait states for slower memorydevices.
The FLASH memory devices targeted for this design are the sametechnology used in the 16-bit design, with maximum operation speed of20 MHz.
#4: Data MemoryInterface. The system SRAM, along with the memory mappedperipherals, are connected to the processor data bus via the systemcontroller. The system controller provides additional address decodingand other control functions to allow the processor core to correctlyaccess either its data memory or the memory mapped peripherals withouthaving to manage specific wait states, data widths or other specialneeds of each device mapped into data memory space.
The data bus between the system controller and the processor core is32 bits wide, as is the data bus width between the system controllerand the SRAM. The data buses between the system controller and theperipherals and GPIO ports are either 8, 16 or 32 bits, as required.
The SRAM available in the targeted design is the same type used inthe 16-bit design, which allows for 0 wait state operation at 120 MHz.
The performance of the system is now controlled by several factors. Thedifference in processor core speed against the FLASH memory devicespeed provides the greatest performance impact, since at least fivewait states must be added to every instruction fetch. Using a coarserule of thumb that suggests there is a load or store at least every 10instructions, the weighted average cycles per instruction (CPI) of atypical sequence is:
CPI = (9 inst * 6 FLASH cycles + 1inst *1 SRAM cycle) / 10 instructions
CPI = 5.5
The throughput of the core is dominated by the speed of the FLASHmemory interface, and that the penalty is very large. At this point,all the 32-bit core has done is double the data path width.
The SRAM interface is of no consequence in this case. While certainissues such as interrupt latency and atomic bit manipulation willprobably arise from a memory interface perspective, the 0 wait stateoperation of the SRAM memory can be ignored. The focus must be onimproving the performance of the instruction side memory interfaceusing currently available, cost-effective technology.
Improving the CPU Core: FLASHInterface
A common concept that came from the high-performance computingenvironment is caches, in which smaller and faster memory stores areplaced between the main memory devices and the processor core, allowingquicker access to bursts of data or program instructions.
While design and implementation of caches can be verycomplex—considering issues of cache tags, n-way association and generalcache control—focusing only on program instruction memory makes thistask easier. That is because accesses to program memory are strictly aread-only operation in the case of our specific 32-bit core. Onlyhaving to worry about data flow in one direction reduces the complexityof the buffer and cache systems that follow.
Pre-Fetch Buffer. A simple way to increase the overall bandwidth ofthe FLASH interface is to widen the path between the processor andFLASH device. Given that the speed of the FLASH memory is fixed, theonly other way to increase bandwidth is to widen the interface so thatmore instructions can be fetched at one time, creating the appearanceof a faster FLASH memory interface.
This is the basic premise behind a pre-fetch buffer. It takesadvantage of its wider interface to the FLASH memory to read a largeramount of data in the same number of clock cycles it would normallytake to read one word from FLASH memory.
The width of this new data path also defines for obvious reasons theminimum size of the pre-fetch buffer store.
Figure 2 below shows our 120MHz core interfacing to a 20 MHz FLASH array. Using the ratio of speedbetween the two systems as a starting point, we can determine how widethe pre-fetch buffer / FLASH interface must be to read, given that wewould like to read instructions as if there were no wait statesinvolved. In this case, the pre-fetch / FLASH data path would be:
120/20) X 32 bits = 192 bits wide .
The pre-fetch buffer control logic keeps tabs on the number of readaccesses to the buffer. After the last access, it will cause the nextcycle to reload the entire buffer from the FLASH memory.
The pre-fetch buffer control logic also knows the effective addressfor every entry in the buffer. It will also provide the proper decodingto present the processor data bus with the correct sequentialinstruction, and will reload the buffer when an execution branchrequires a completely new sequence of instructions.
|Figure2. Instruction Pre-Fetch|
Branching, of course, will incur some additional delay in fetchingthe new instructions, but since the pre-fetch buffer implements asix-to-one advantage in data path width over the processor core, theaveraged-out effect of this branch penalty is worth it.
More rule of thumb analysis suggests that branches occur inapproximately 20 percent of a typical embedded application, translatinginto a branch once every five cycles. Using the same method as before,the CPI value is now:
CPI = (4 instructions * 1 cycle + 1instruction * 6 cycles) / 5 instructions
CPI = 2.0
Points of improvement
Already we have seen a significant improvement in the cycle efficiencyof the entire system over the basic implementation.
Figure 2 above alsoindicates a more realistic approach to the solution, summing the 32-bitbuses of six individual FLASH memory systems together, rather thanredesigning a new, extremely wide, data bus FLASH system. The pre-fetchbuffer control logic would automatically create the six consecutiveprogram addresses and then allow a normal read cycle to access all sixbanks in parallel. At the end of the read cycle, the pre-fetch buffernow holds six new instructions instead of just one, simulating a 0 waitstate system.
InstructionCache. The formal instruction cache takes the pre-fetch bufferto the next level of complexity, in that the cache buffer need notcontain specifically linear address ranges from throughout the entirecache array. Formal caches are also larger in size than simplepre-fetch buffers, which can possibly store an entire loop sequencewithin the cache memory.
Figure 3 below illustrates asimple eight-line, one-way cache design with 16-byte cache lines. Whilea cache this small would never be implemented, it is useful forinstructional implementation. In this case, the address tags would bethe upper 12 bits of the total address, and the index would be theremaining two bits needed to address the specific entries in the cacheline.
The instruction cache system has a more complex address comparisonsystem than the instruction pre-fetch buffer, in that the cache arraycontains only contiguous addressed instructions for each cache line,but allows any range within the instruction address space to becontained within a cache line.
For the cache to be effective, it is also advisable to implement awider data path between the FLASH device and the instruction cachememory to allow the cache to be filled as fast as the core can consumethe program instructions.
Instruction cache implementations automatically include aninstruction pre-fetch implementation in the interface to the FLASHmemory to address this issue. Otherwise, the problem of FLASH accesstime is just moved one step farther away from the processor core.
During normal execution, an instruction fetch initiates a sequenceof comparisons between the upper bits of the instruction addressrequired and the instruction tags in the cache array. Should an addressmatch be found, a cache hit is registered, and the lower bits of theinstruction address are used to index within the cache line to retrievethe required instruction.
If no match is detected, this is known as a cache miss. A cache misswill cause the cache controller to read-in a specific cache line from amemory range that contains the requested instruction. The cache linereplaced is usually the oldest cache line in the array.
|Figure3. Instruction Side Cache|
The basic performance analysis when using a cache becomes moreinvolved in that the number of cache misses now introduces anothervariable into the equation. Analyzing typical application code can helpchip designers determine the best balance between cache size and thepractical gain in performance.
For our design, it's reasonable to assume that the CPI will fallbetween the range as shown:
1.0 <= CPI (cache) <= 2.0
In the case where the cache is large enough to store the majority ofthe main loops of the application, the performance gain can bedramatic, since the system is now approaching the 0 wait stateexecution environment.
One important advantage of using an instruction cache in a ModifiedHarvard Architecture design is that the cache need not implement awrite-back operation. This makes the implementation easier than a datacache, which must also ensure modified cache data is properly storedinto main data memory.
The Improved Design
We can now apply the lessons learned to the first block diagram of oursystem and benefit from the gains that the pre-fetch / i-cache buffersystem provides. The result is Figure4 as shown below.
|Figure4. Improved 32 bit core design|
The new design can execute instructions at a minimum of three timesfaster (120 MHz / 2.0 CPI (pre-fetch) / 20 MHz (16-bit clock) than theprevious 16-bit design, and, depending on the final instruction cachesize selected, will likely see performance very close to that of asystem running from a single wait state FLASH memory system.
While the instruction pre-fetch buffer is a simple implementation,it drastically improved system throughput by masking the access timedifference between the FLASH memory and the 32-bit core executionspeed. The pre-fetch buffer is a fairly simple design that requiresvery little extra logic.
The majority of the extra logic relates to widening the path betweenthe FLASH memory system and the pre-fetch buffer. The simplicity ofthis design allows it to be completely transparent to the softwareprogrammer, who must only enable or disable the feature.
The formal instruction cache is a more complex solution thatrequires at least the same amount of additional logic circuits as thepre-fetch buffer, as well as extra logic circuits to manage properoperation of the instruction cache.
Designers should analyze the typical applications run on the MCU todetermine the cache size that best balances performance and cost. Theinstruction cache is of course more expensive to implement, but in manycases the resulting performance of the system can reach that of a 0wait state system, having a dramatic positive effect on performance.
Software programmers must also be aware of some basic control andmaintenance issues related to the instruction cache, but in most casesthese are set-and-forget operations that must only be executed duringsystem initialization.
It is not enough to just directly replace an existing 8- or 16-bitcore with a new 32-bit device. Chip designers must also adjust andimprove the entire MCU design to adapt to the new requirements ofhigh-performance, high-speed 32-bit cores.
This adjustment is needed to ensure that the maximum amount ofperformance is extracted from the new 32-bit core. Using the pre-fetchbuffer and the i-side cache are two straightforward methods ofimproving a microcontroller design that must deal with a 32-bit coreand existing memory technology.
Bob Martin is a senior applications engineer for MIPSTechnologies, Inc. Bob has more than 20 years of experience inembedded system design in both hardware and software. He has designed8-, 16- and 32-bit embedded hardware systems used in applicationsranging from fault tolerant military and scientific systems to personaldigital media players. Contact Bob at email@example.com.