Making sense out of 8-bit and 32-bit MCU options for your next IoT application - Embedded.com

Making sense out of 8-bit and 32-bit MCU options for your next IoT application

I was in the middle of the show floor at Embedded World talking to an excitable man with a glorious accent. When I told him about our newly launched EFM8 MCUs, he stopped me and asked, “But why would I want to use an 8-bit MCU?” This wasn't the first time I had heard the question, and it certainly won’t be the last.

It's a natural assumption that just as the horse-drawn buggy gave way to the automobile and snail mail gave way to email, 8-bit MCUs have been eclipsed by 32-bit devices. While that MCU transition may become true in some distant future, the current situation isn't quite that simple. It turns out that 8- and 32-bit MCUs are still complementary technologies, each excelling at some tasks while performing equally well at others. The trick is figuring out when a particular application lends itself to a particular MCU architecture.

This article compares use cases for 8-bit and 32-bit MCUs and serves as a guide on how to choose between the two MCU architectures. Most of our 32-bit examples will focus on ARM Cortex-M devices, which behave very similarly across MCU vendor portfolios. There is a lot more architectural variation on the 8-bit MCU side, so it’s harder to apply apples-to-apples comparisons among 8-bit vendors. For the sake of comparison, we will use the widely used, well-understood 8051 8-bit architecture, which remains popular among embedded developers.

On Optimization and Holy Wars

Sometimes when I'm comparing things that people know (like ARM and 8051), I feel like I just posted “Star Trek is better than Star Wars” on an Internet forum. Things can get heated quickly.

The truth is that “Which is better: ARM Cortex or 8051?” is not a logical question. It's like asking, “Which is better: a guitar or a piano?” A much better question is “Which MCU will best help me solve the problem I'm working on today?” Different jobs require different tools, and our goal is to understand how best to apply the tools we have, including both 8-bit and 32-bit devices. Anyone who provides a simple answer to the “ARM versus 8051” question almost certainly has an agenda and is trying to sell something.

To compare devices, you need to measure them. There are a lot of build tools to choose from, and I’ve tried to select scenarios I believe provide the fairest comparison and are most representative of real-world developer experiences. The ARM numbers below were generated with GCC + nanoCLibrary and -03 optimization.

I made no attempt to optimize the code for either device. I simply implemented the most obvious “normal” code that 90 percent of developers would come up with. I am much more interested in what the average developer will see than what can be achieved under ideal circumstances. It's certainly possible to spend a lot of time, effort and money tweaking 8051 code to make it do a job better suited to an ARM (or vice versa), but it's much easier to simply pick the best tool for the job in the first place.

Not all MCUs are created equal

Before we begin comparing architectures, it's important to note that not all MCUs are created equal. If we pit a modern ARM Cortex-M0+ processor-based MCU against an 8051 MCU built 30 years ago, the 8051 won't stand a chance in a performance comparison. Fortunately, a number of vendors have made continuing investments in 8-bit processors. For example, Silicon Labs’ EFM8 line of 8051-based MCUs has been updated to be far more efficient than the original 8051 architecture, and the development process has been modernized. The result is an 8-bit core that can easily hold its own against an M0+ or M3 core in many applications, and can even perform better in some.

Development tools are also important. Modern embedded firmware development requires a fully-featured IDE, ready-made firmware libraries, extensive examples, comprehensive evaluation and starter kits, and helper applications to simplify things like hardware configuration, library management and production programming.

Once an MCU has a modernized 8-bit core and development environment, there are many situations in which that MCU will excel in comparison to ARM-Cortex-based counterparts.

Next page >

    Page index for this article

  1. Introduction
  2. General trade-offs
  3. Code-space efficiency
  4. Control vs. processing
  5. Working through the options

General Trade-offs

Before we dig into core architectures and other technical details, it's important to convey some general guidelines and context. In my college days, I remember taking a test and being so intent on scoring better and finishing before my classmates that I didn't notice that the test had been printed on the front and back of each page. Needless to say, I did finish first, but it's not an experience I'd care to repeat. There is no sense in analyzing complex MCU features and functionality if an application simply requires 256 KB of flash or volume pricing of $0.25. Those requirements are enough to indicate which MCU architecture is the best fit.

System Size

The first generality is that ARM Cortex-M cores excel in large systems (> 64 KB of code), while 8051 devices excel in smaller systems (< 8 KB of code). The middle ground could go either way, depending on what the system is doing. It's also important to note that in many cases, peripheral mix will play an important role. If you need three UARTs, an LCD controller, four timers and two ADCs, chances are you won't find all of those peripherals on an 8-bit part.

Ease of Use vs. Cost and Size

For systems sitting in the middle ground where either architecture can do the job, the big trade-off is between the ease of use that comes with an ARM core and the cost and physical size advantages that can be gained with an 8051 device.

The unified memory model of the ARM Cortex-M architecture, coupled with full C99 support in all common compilers, makes it very easy to write firmware for this architecture. In addition, there is a huge set of libraries and third-party code to draw from. Of course, the penalty for that ease-of-use is cost. Ease-of-use is an important factor for applications with high complexity, short time-to-market or inexperienced firmware developers.

While there is some cost advantage when comparing equivalent 8- and 32-bit parts, the real difference is in the cost floor. It's common to find 8-bit parts as small as 2 KB/512 bytes (flash/RAM), while 32-bit parts rarely go below 8 KB/2 KB. This range of memory sizes allows a system developer to move down to a significantly lower-cost solution in systems that don't need a lot of resources. For this reason, applications that are extremely cost-sensitive or can fit in a very small memory footprint will favor an 8051 solution.

8-bit parts also generally have an advantage in physical size. For example, the smallest 32-bit QFN package offered by Silicon Labs is 4 mm x 4 mm, while our 8051-based 8-bit parts are as small as 2 mm x 2 mm in QFN packages. Chip-scale package (CSP) options show a smaller difference between 8/32-bit architectures, but also come with increased cost and more difficult assembly requirements. Applications that are severely space-constrained often need to use an 8051 device to satisfy that constraint.

General Code and RAM efficiency

One of the major reasons for the lower cost of an 8051 MCU is that it generally uses flash and RAM more efficiently than an ARM Cortex-M core, which allows systems to be implemented with fewer resources. The larger the system, the less impact this will have.

It's also important to note that this 8-bit memory resource advantage is not always the case. In some situations, an ARM core will be as efficient or even more efficient than an 8051 core. For example, 32-bit math operations require only one instruction on an ARM device, while requiring multiple 8-bit instructions on an 8051 MCU. Obviously, ARM architecture will be much more efficient for that code.

The ARM architecture has two major disadvantages at small flash/RAM sizes: code-space efficiency and predictability of RAM usage.

The first and most obvious issue is general code-space efficiency. The 8051 core uses 1-, 2- or 3-byte instructions, and ARM cores use 2- or 4-byte instructions. The 8051 instructions are smaller on average, but that advantage is mitigated by the fact that a lot of the time, the ARM core can do more work with one instruction than the 8051. The 32-bit math case is just one such example. In practice, instruction width results in only moderately more dense code on the 8051.

Next page >

    Page index for this article

  1. Introduction
  2. General trade-offs
  3. Code-space efficiency
  4. Control vs. processing
  5. Working through the options

In systems that contain distributed access to variables, the load/store architecture of the ARM architecture is often more important than the instruction width. Consider the implementation of a semaphore where a variable needs to be decremented (allocated) or incremented (freed) in numerous locations scattered around code. An ARM core must load the variable into a register, operate on it and then store it back, which takes three instructions. The 8051 core, on the other hand, can operate directly on the memory location and requires only one instruction. As the amount of work done on a variable at one time goes up, the overhead due to load/store becomes negligible, but for situations where only a little work is done at a time, load/store can dominate and give the 8051 a clear efficiency advantage.

While semaphores are not common constructs in embedded software, simple counters and flags are used extensively in control-oriented applications and behave the same way. A lot of common MCU code falls into this category.

The other piece of the puzzle involves the fact that an ARM processor makes much more liberal use of the stack than an 8051 core. In general, 8051 devices only store return addresses (2 bytes) on the stack for each function call, handling a lot of tasks through static variables normally associated with the stack. In some cases, this creates an opportunity for problems, since it causes functions to not be re-entrant by default. However, it also means that the amount of stack space that must be reserved is small and fairly predictable, which matters in MCUs with limited RAM.

As a simple example, I created the following program. Then I measured the stack depth inside funcB and found that the M0+ core's stack consumed 48 bytes, while the 8051 core's stack consumed only 16 bytes. Of course, the 8051 core also statically allocated 8 bytes of RAM, consuming 24 bytes total. In larger systems, the difference is negligible, but in a system that only has 256 bytes of RAM, it becomes important.

int main(void){   funcA(0xACED);  while (1);}void funcA(uint32_t a){  uint8_t i, j=0;  for (i=0; i<3; i++){j = funcB(i, j);	}}uint16_t funcB(uint16_t testA, uint16_t testB){  return  (testA * testB)/(testA - testB)}

Architecture Specifics

We've now painted our basic picture. Assuming there is both an ARM and an 8051-based MCU with the required peripherals, the ARM device will be a better choice for a large system or an application where ease-of-use is an important factor. If low cost/size is the primary requirement, then an 8051 device will be a better choice. Now it's time to look at a more detailed analysis of applications where each architecture excels and where our general guidelines break down.

Latency

There is a noticeable difference in interrupt and function-call latency between the two architectures, with 8051 being faster than an ARM Cortex-M core. In addition, having peripherals on the Advanced Peripheral Bus (APB) can also impact latency since data must flow across the bridge between the APB and the AMBA High-Performance Bus (AHB). Finally, many Cortex-M-based MCUs require the APB clock to be divided when high-frequency core clocks are used, which increases peripheral latency.

I created a simple experiment where an interrupt was triggered by an I/O pin. The interrupt does some signaling on pins and updates a flag based on which pin performs the interrupt. I then measured several parameters shown in the following table. The 32-bit implementation is listed here.

---//Status varVolatile uint8_t hello;//ISRvoid GPIO_ODD_IRQHandler(void){   GPIO->P[gpioPortA].DOUTSET = 0x03; // T1  GPIO->P[gpioPortA].DOUTCLR = 0x01; // T2  if(GPIO->IF & 0x0100){	  hello = 4;  }  else  {	  hello = 5;  }  GPIO->IFC = 0xFFFF;                // clear interrupt  GPIO->P[gpioPortA].DOUTCLR = 0x02; //T3}//Main loop  while (1)  {	hello = 0;    GPIO->P[gpioPortA].DOUTSET = 0x04; //T0    while(!hello);    GPIO->P[gpioPortA].DOUTCLR = 0x04; //T4    for(i=0; i< 0x1000; i++);  }---    |--------------------------|------|------|----  |Parameter                 | ARM  | 8051 |  |ISR Entry latency (T1-T0) | 1.09 | 0.94 | µs  |Min pulse width (T2-T1)   | 0.09 | 0.08 | µs  |ISR Execution Time (T3-T1)| 1.09 | 0.74 | µs  |ISR Exit Time (T4-T3)     | 0.83 | 0.57 | µs  |TOTAL                     | 3.10 | 2.53 | µs  |--------------------------|------|------|----

The 8051 core shows an advantage in Interrupt Service Routine (ISR) entry and exit times. However, as the ISR gets bigger and its execution time increases, those delays will become insignificant. In keeping with the established theme, the larger the system gets, the less the 8051 advantage matters. In addition, the advantage in ISR execution time will swing to the ARM core if the ISR involves a significant amount of data movement or math on integers wider than 8 bits. For example, an ADC ISR that updates a 16- or 32-bit rolling average with a new sample would probably execute faster on the ARM device.

Next page >

    Page index for this article

  1. Introduction
  2. General trade-offs
  3. Code-space efficiency
  4. Control vs. processing
  5. Working through the options

Control vs. Processing

The fundamental competency of an 8051 core is control code, where the accesses to variables are spread around and a lot of control logic (if, case, etc.) is used. The 8051 core is also very efficient at processing 8-bit data while an ARM Cortex-M core excels at data processing and 32-bit math. In addition, the 32-bit data path enables efficient copying of large chunks of data since an ARM MCU can move 4 bytes at a time while the 8051 has to move it 1 byte at a time. As a result, applications that primarily stream data from one place to another (UART to CRC or to USB) are better-suited to ARM processor-based systems.

Consider this simple experiment. I compiled the function below on both architectures for variable sizes of uint8_t, uint16_t and uint32_t.

uint32_t funcB(uint32_t testA, uint32_t testB){	return  (testA * testB)/(testA - testB)}    |data type  | 32bit(-o3) | 8bit || uint8_t   |         20 |   13 | bytes| uint16_t  |         20 |   20 | bytes| uint32_t  |         16 |   52 | bytes

As the data size increases, the 8051 core requires more and more code to do the job, eventually surpassing the size of the ARM function. The 16-bit case is pretty much a wash in terms of code size, and slightly favors the 32-bit core in execution speed since equal code generally represents fewer cycles. It’s also important to note that this comparison is only valid when compiling the ARM code with optimization. Un-optimized code is several times larger.

This doesn't mean applications with a lot of data movement or 32-bit math shouldn't be done on an 8051 core. In many cases, other considerations will outweigh the efficiency advantage of the ARM core, or that advantage will be irrelevant. Consider the implementation of a UART-to-SPI bridge. This application spends most of its time copying data between the peripherals, a task the ARM core will do much more efficiently. However, it's also a very small application, probably small enough to fit into a 2 KB part.

Even though an 8051 core is less efficient, it still has plenty of processing power to handle high data rates in that application. The extra cycles available to the ARM device are probably going to be spent sitting in an idle loop or a “WFI” (wait for interrupt), waiting for the next piece of data to come in. In this case, the 8051 core still makes the most sense, since the extra CPU cycles are worthless while the smaller flash footprint yields cost savings. If we had something useful to do with the extra cycles, then the extra efficiency would be important, and the scales may tip in favor of the ARM core. This example illustrates how important it is to view each architecture’s strengths in the context of what the system being developed cares about. It's a simple but important step to making the best decision.

Pointers

8051 devices do not have a unified memory map like ARM devices, and instead have different instructions for accessing code (flash), IDATA (internal RAM) and XDATA (external RAM). To enable efficient code generation, a pointer in 8051 code will declare what space it's pointing to. However, in some cases, we use a generic pointer that can point to any space, and this style of pointer is inefficient to access. For example, consider a function that takes a pointer to a buffer and sends that buffer out the UART. If the pointer is an XDATA pointer, then an XDATA array can be sent out the UART, but an array in code space would first need to be copied into XDATA. A generic pointer would be able to point to both code and XDATA space, but is slower and requires more code to access.

Segment-specific pointers work in most cases, but generic pointers can come in handy when writing reusable code where the use case isn't well known. If this happens often in the application, then the 8051 starts to lose its efficiency advantage.

Next page >

    Page index for this article

  1. Introduction
  2. General trade-offs
  3. Code-space efficiency
  4. Control vs. processing
  5. Working through the options

Working through the options

I've noted several times that math leans towards ARM, and controlleans towards 8051, but no application focuses solely on math orcontrol. How can we characterize an application in broad terms andfigure out where it lies on the spectrum it lies?

Let’s consider a hypothetical application composed of 10 percent32-bit math, 25 percent control code and 65 percent general code thatdoesn’t clearly fall into an 8 or 32-bit category. The application alsovalues code space over execution speed, since it does not need all theavailable MIPS and must be optimized for cost. The fact that cost ismore important than application speed will give the 8051 core a slightadvantage in the general code. In addition, the 8051 core has moderateadvantages in the control code. The ARM core has the upper hand in32-bit math, but that doesn’t account for much of the application.Taking all these variables into consideration, this particularapplication is a better fit for an 8051 core.

If we make a slight change and say that the application is moreconcerned with execution speed than with cost, then the general-purposecode wouldn’t really favor either architecture, and the ARM core wouldtake full advantage in the math code. In this case, there is morecontrol code than math, but the overall result would come out fairlyeven.

Obviously, there is a lot of estimation in this process, but thetechnique of deconstructing the application and then evaluating eachcomponent will help make sure we are aware of the cases where there is asignificant advantage to be had for one architecture over the other.

Power Consumption

When looking at data sheets, it's easy to come to the conclusion thatone MCU edges out the other in terms of power consumption. While it'strue that the sleep mode and active mode currents will favor certaintypes of MCUs, that assessment can be extremely misleading.

Duty cycle (how much time is spent in each power mode) will alwaysdominate energy consumption. Unless the duty cycle is the same on bothparts, the data sheet current specs are virtually meaningless. The corearchitecture that best fits the application requirements will generallyhave lower energy consumption.

Consider a system where the device wakes up, adds a 16-bit ADC sampleto a rolling average and goes back to sleep until the next sample. Thattask involves a significant amount of 16-bit and 32-bit math. The ARMdevice is going to be able to make the calculations and go back to sleepfaster than an 8051 device, which results in a lower power system, evenif the 8051 has better sleep and active mode current. Of course, if thetask being done is better suited to an 8051 device, then the MCU’senergy consumption will come out in its favor for the same reason.

Peripheral features can also skew power consumption one way or theother. For example, most of Silicon Labs’ EFM32 32-bit MCUs have alow-energy UART (LEUART) that can receive data while in low power mode,while only two of the EFM8 MCUs offer this feature. This peripheralaffects the power duty cycle and heavily favors the EFM32MCUs over EFM8devices lacking an LEUART in any application that waits for UARTtraffic. Unfortunately, there is no easy guide to assess theseperipheral considerations other than asking your MCU vendor’s localapplications engineer. The system designer also should be aware of whatprocessing tasks can be done in each MCU energy mode.

8-bit or 32-bit? I still can't decide!

What happens if, after considering all of these variables, it's stillnot clear which MCU architecture is the best choice? Congratulations!That means they are both good options, and it doesn't really matterwhich architecture you use. Past experience and personal preferencesalso play a big part in your MCU architecture decision if there is noclear technical advantage. In addition, this is a great time to look atfuture projects. If most future projects are going to be well-suited toARM devices, then go with ARM, and if future projects are more focusedon driving down cost and size, then go with 8051.

What does it all mean?

8-bit MCUs still have a lot to offer embedded developers and theirever-growing focus on the Internet of Things. Whenever a developerbegins a design, it's important to make sure that the right tool iscoming out of the toolbox. While I'm more than happy to sell an 8051 MCUto a customer who might be better served by a 32-bit device, I can'thelp but think of how much easier their job would be or how much betterthe end product would be if the developer spent just an hour thinkingthrough that decision.

The difficult truth is that choosing an MCU architecture can't bedistilled into one or two bullet points on a PowerPoint presentation.However, making the best decision isn't hard once you have the rightinformation and are willing to spend a little time applying it.

    Page index for this article

  1. Introduction
  2. General trade-offs
  3. Code-space efficiency
  4. Control vs. processing
  5. Working through the options

5 thoughts on “Making sense out of 8-bit and 32-bit MCU options for your next IoT application

  1. “Yup, there is still space for 8-bitters but that space is shrinking all the time. As the M0 parts get cheaper and cheaper it gets harder and harder to justify trying to use an 8051 to shave a few cents off the cost of a device which will end up needing mo

    Log in to Reply
  2. ” Just as a comparison point the lowest cost EFM8 part is 21 cents in 10K volumes, and that's for a part with limited memory but a lot of peripheral capability (25Mhz core, SPI, UART, I2C, ADC, TempSensor, 3 Timers, PCA, and on-board oscillators).nn As

    Log in to Reply
  3. “Yup, I agree there is still place for 8 bitters. I don't see any for 16 bitters though.nnIt isn't worth chasing 7c for 10k units (that's only $700). You need to be targeting at least a few hundred k units to make the numbers work.nnBack in the 1980s I

    Log in to Reply
  4. “I needed a high-resolution real-time clock that would run for years on a coin cell. A 32768 Hz crystal provides 30 uSec resolution, but none of the commercially-available RTC chips expose the raw tick count. I didn't want to settle for 0.1 sec resolution,

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.