This is the final article in a series on achieving high security for MCU-based systems. The first article covered security basics and partitions; the second covered MPU management; the third covered the need for multiple heaps; and the fourth covered portals between partitions. The hunt for the Holy Grail of MCU security is not over, but we are closer. In this part we cover remaining topics to achieve fully isolated partitions. This is not intended as a tutorial. The references listed in the first part may be helpful for that.
Software Interrupt (SWI) API
ptasks can directly call system services in pmode. However, utasks require a Software Interrupt (SWI) interface for system services such as signaling a semaphore. The SWI API is implemented using the Cortex-M svc n instruction, which causes an SVC exception that results in switching to pmode and executing the desired system service. The parameter n selects the system service to be performed. The svc instruction is the only way that a utask can penetrate the pmode barrier, and then only to run a permitted system service. When the system service completes, the utask is resumed in umode with the return value and data, if any, from the service. Figure 1 illustrates this process.
Figure 1: System Service Calls from Both Modes. (Source: Author)
Not only system services but also the structures they use (e.g. Task Control Blocks) reside in pmode and thus are not accessible to a hacker from umode. In addition, services that could cause system damage are not permitted from utasks. Attempted use of a restricted system service results in a Privilege Violation Error (PVE), causing the Error Manager to run, the utask to be stopped, and recovery software to take control, thus stopping a hacker dead in his tracks.
Interrupt Service Routines (ISRs)
Unfortunately, ISRs must execute in pmode, and thus they provide an attack surface into pmode, which is highly undesirable. As a consequence, ISR code must be minimized and written using the best secure programming techniques. Unfortunately, most RTOSs allow specialized versions of their RTOS services (e.g. xSemaphoreTakeFromISR()) to be called from ISRs. This increases the attack surface, and therefore all of these must also be written using the best secure programming techniques. This can become a monumental job.
A better solution is to use Link Service Routines (LSRs) to perform deferred interrupt processing, in which normal RTOS service calls can be made. Invoking an LSR in an ISR results in putting it into the LSR queue, lq. When all nested ISRs have run, the processor branches to the LSR scheduler, which runs the LSRs in the order invoked, thus maintaining temporal integrity. LSRs run ahead of all tasks and therefore priority inversion is not possible. By obfuscating LSR addresses, a hacker cannot easily access LSR code, thus the attack surface is reduced to the ISR, itself – a much more manageable task.
Critical code sections, particularly in low-level driver code, are generally protected by interrupt disable before and interrupt enable after. When moving a partition from pmode to umode, both of these instructions become NOPs – i.e. they offer no protection. One might think that SVC shell functions could be created for them. This is a Catch-22 situation – whereas the interrupt disable function would work, the interrupt enable function cannot work because interrupts (including SVC) are disabled!
Instead, SVC functions are provided to mask and unmask interrupts. Permitted interrupts are specified on a task basis so that a hacker cannot mask interrupts used outside of the partition he has infected.
Gaps and Tails
For Cortex-v7M, the power-of-two size and alignment requirement for MPU regions creates serious memory waste. Cortex-v8M fixed this, so this section applies only to v7M.
Gaps are wasted memory between region blocks. Unlike the heap discussed in part three, the IAR ILINK linker puts all disabled subregions after their region blocks. As a consequence, large gaps can occur. To counter this, we developed the MpuPacker utility; it fills gaps with smaller region blocks. For example, the region blocks of a moderate size system were organized by decreasing size, and then MpuPacker was applied. The result for ROM was: total gap space decreased from 0x12200 to 0x400, a savings of 73,216 bytes and for SRAM, total gap space decreased from 0x400 to 0xe0, a savings of 800 bytes. If the reduction by MpuPacker is not enough, plug blocks can be defined from memory used for boot, initialization, and shutdown, which are not in run-time regions, and these can be used to provide more gap fill.
Figure 2 illustrates how gaps can be filled. The left side illustrates the initial locations of blocks. Total memory used is 1408 bytes. The right side illustrates moving two small blocks to aligned positions within the disabled subregions of the large block. Dashed lines show subregion boundaries. Final size is 1024 – a 23% reduction of wasted memory.
Figure 2: Moving Blocks to Fill a Large Gap. (Source: Author)
Tails are wasted memory inside of region blocks. A tail can be as large as the subregion size for the block, minus one byte. In the example cited above, total ROM tails = 34,740 bytes, and total SRAM tails = 38,248 bytes. This can be a big problem and one that is much harder to fix than gaps. However, if available memory is not being exceeded, spare memory might as well be distributed among tails. This is because tails provide expansion memory for partition updates without impacting other partitions, thus allowing smaller and faster updates.
The following techniques can be used to reduce tails:
- If a tail is greater than ½ the region size, for the block, reduce the region size.
- If a tail is greater than or equal to the subregion size, for the block, disable the subregion(s) occupied by the tail.
- If code or data slightly exceeds a subregion boundary, improve code or data efficiency in order to reduce the code or data below the subregion boundary, then disable that subregion or reduce the region size if the last 3 subregions were already disabled.
- If a spare slot is available in every partition template using the region, split the region block into two smaller region blocks, such that the sum of the tails is smaller than the original tail. This may require some experimentation.
- Use auxiliary slots to free up active slots and/or reduce region block sizes in order to apply the above methods.
- Split partitions into smaller partitions so that regions are smaller and the sum of the resulting tail sizes is smaller than the original sum of the tail sizes. This is likely to require adding new tasks. However, smaller partitions enhance security.
Obviously, the foregoing effort can take a lot of work and thus it is likely to be feasible only after all development is done. During development, we recommend using a pin-compatible MCU with much larger internal memory, if one is available. If not, then use stub partitions for partitions not being debugged. These either have less-important functionality removed or simply return constants for service requests. Stubs can also be used during the product support phase when tracking down bugs or vulnerabilities. In fact, partition stubs can be useful for debugging other partitions.
An ordinary debugger such as IAR C-SPY works fine for partition debugging. However, there are some differences from normal code debugging, as follows:
- Memory manage fault (MMF) blizzards. If you are moving a partition from pmode to umode, you are likely to encounter an MMF blizzard that goes on and on and on. The solution is patience, not suicide. All you can do is fix the current problem, run the code to the next MMF, and repeat the process. The call stack is helpful tool. Clicking on the top entry takes you to the exact point in the code that caused the MMF. Usually this will be a function or variable outside of the partition, so it is easy to fix. Sometimes, however, it will be a parameter of the function. The parameter may be outside of the partition or it may be a handle (see below). Really tough problems are best solved by breaking at the point of MMF, then stepping through the code in the disassembly window.
- Handles. Most RTOSs use handles, and we become so accustomed to using them that we forget that handle names represent the addresses of the handles, themselves, which are, in turn, pointers to RTOS objects. The compiler, bless its little soul, wants to load and dereference the handle address in order to pass the value of the handle as an argument. It is easy to forget that you created an object outside of the current partition, hence the address of the handle is outside of the current partition and it triggers an MMF. One can struggle with this problem for a long time, without seeing it. The solution is to step through the assembly code. You will see an LDR into a register, then an LDR using that register, then voila! an MMF.
- Broken Call Stacks. We are accustomed to using call stacks to trace a problem back to its origin. For example, a file system function may be failing due to a wrong parameter in the file system service call. Going back to the origin of the call makes fixing a problem, like this, easy. When a direct call API is replaced with a portal, the call stack is broken because the file system server and the file system client are implemented in different tasks. This is why it is a good idea to preserve the capability to easily switch back and forth from direct calls to portal calls while debugging a client, portal, and server combination. Another approach is to record the client call address in the pmsg header.
- Wild Pointers. Uninitialized and corrupted pointers are latent bugs. These will usually trigger MMFs. The act of finding them is a gift from the Security God and you should pay proper homage – perhaps a small altar on your desk.
Arm Platform Security Architecture (PSA)
Platform Security Architecture (PSA) was developed by Arm Ltd. to provide high security, primarily for Cortex processors. Its main aspects are described in Ref. 1. The PSA Root of Trust (RoT) is the hardware and software implementation of PSA (see Ref. 2). According to it, a system is divided into a Secure Processing Environment (SPE) and a Non-Secure Processing Environment (NSPE). v8M TrustZone (TZ) is the hardware embodiment of the PSA Immutable RoT (see Ref. 3). SPE corresponds to the TZ Secure State and NSPE corresponds to the TZ Non-Secure State. For Cortex-M, each state has its own security hardware (MPU, SVC exception, pmode, and umode).
PSA RoT provides the following security services:
- Secure boot
- Secure update
- Encryption, decryption, and authentication
- Others specific to the SPE
The Application RoT provides additional vendor-specific security services.
Security services are implemented in Secure Partitions which are managed by the Secure Partition Manager (SPM). Secure boot is activated by power on or by system reset. The other PSA RoT and Application RoT services are available via the Inter Partition Communication (IPC) protocol. When the SPM is allowed to run by the RTOS scheduler, it runs each Secure Partition that has work to do in a multitasking manner.
The secure RTOS described in this article series fits into the Arm PSA as shown by “OS Kernel” in Figure 2 of Ref. 2, so we will call it OSK, in what follows.
If Trust Zone is present, PSA RoT runs in the Secure State (SS) and OSK, with the application code, run in the Non-Secure State (NSS). PSA RoT services can be called from either state. For calls from NSS, call gates are used. Since PSA RoT services can be preempted, a task using them must have two stacks: a secure stack and a non-secure stack, in order for it to resume in the state where it was preempted. This adds complexity to the OSK.
If TrustZone is not present, the OSK provides all run-time security. It is still possible to use PSA security services – especially boot and update. Since other PSA security services are already partitioned, it might be possible to convert RoT partitions into OSK partitions, to activate services via 32-bit event groups, and to use tunnel portals for commands and data transfers. Alternatively, it may be easier to just call the services directly from a pmode security partition.
OSK provides high security for a range of processors from small Cortex-M processors up to large Cortex-M TrustZone processors. PSA RoT provides high security for a range of processors from Cortex-M TrustZone processors up to the most powerful Cortex-A processors. Thus a large range of OEM devices can be covered for device OEMs, by using both.
Hardware Abstraction Layer (HAL)
HALs should be modularized such that there is a separate C and/or assembly file containing all code and data for each controller or functional area of an MCU, including system initialization and system shutdown. System initialization and shutdown code and data should not be accessible during runtime. Each controller HAL should contain its controller-specific initialization code and its own data. Common HAL data between modules is not acceptable, and common HAL code between modules requires scrutiny. Partitioning must be implemented from the ground up. In some cases this may require hardware redesign. For example, pin assignments should be once-programmable via fuses.
Suggested Hardware Improvements
The following are suggested Cortex-M processor improvements to achieve better partitioning and thus better security:
- Once turned on, the MPU cannot be turned off without rebooting the system.
- The MPU controls access to all memory, including the Private Peripheral Buses, the System Control Space, etc.
- The MPU can be changed only in super-privileged mode (spmode). Only ultra-trusted software, such as the task scheduler, should be allowed to change the MPU. This would make ptasks secure.
- Permit MPU regions to overlap as in v7M, in order to foster dynamic regions.
- Increase MPU slots to at least 10 or 12 so that a mixture of static pmode regions and active pmode/umode regions can be used. Then static sys_code and sys_data regions can always be present for exception handlers and system services to run in pmode.
- Eliminate Background Region (BR). BR permits accessing all memory, but MPU region attributes override default region attributes, and they depend upon what task was interrupted. Hence, using BR is like piloting a boat down the Rhine trying not to run aground on ever-shifting and hidden sand bars.
- RTOS and system services reside in spmode.
- spmode is probably determined by the memory locations of spmode code and data.
- spmode services are accessed via call gates.
- Implement an MPU Load instruction that loads all active regions, at once, from a task MPA in order to speed up task switching.
- Static MPU regions are loaded before the MPU is turned on.
- An IO bitmap is part of the MPU active area. For each bit, a pair of addresses in spROM specify the allowed IO address range. This solves the problem of some partitions needing to access multiple IO peripherals vs. limited MPU slots. It also allows limiting each IO region to exactly map one peripheral’s registers.
- Mini MPUs are assigned to single ISRs or shared between ISRs. An interrupt causes a switch to the MPU assigned to it, which has minicode, minidata, and ministack regions and an IO bitmap. ISRs cannot be trusted and should not have access to privileged code and data. Including an ISR minidata region in sys_data and activating PendSV to tail-chain after the ISR would enable deferred ISR processing.
- Provide a mechanism to invoke an LSR, with one parameter, from an ISR. The LSR runs in pmode and is capable of making system service calls.
Complexity vs. Simplicity
Layering ever-more-complex software on top of ever-more-complex hardware is not going to solve the security problem – it just increases the attack surface. Currently, software teams are confounded by the extreme complexity of unneeded hardware features. Multi-thousand-page manuals must become a thing of the past. Programmers cannot focus on security when they must struggle just to understand how the hardware works. New hardware architectures should be moving in the direction of making the programmer’s job easier, not harder, as is currently the case.
The objective of hardware architecture should be to achieve basic functionality as simply as possible. Complex operations should be left to software, except in cases where better performance is essential. Current architectures have departed widely from this ideal. The same applies to protocol stacks and to other standard software components. Getting rid of seldom, if ever, used features and focusing on achieving basic requirements in simple ways would substantially reduce attack surfaces.
What is needed is a just-enough design philosophy, not a copycat marketing mentality. Security will remain elusive until vast simplifications are made. A Software Bill of Materials (SBOM) (see Ref. 4) will soon be required for all computer system purchases by the federal government. The result of it will be: No Security, No Sale. Hopefully this will change the C-Suite thinking of device OEMs. Meanwhile, hackers will continue to sing “and the beat goes on”.
There seems to be widespread belief in the embedded software community that teaching programmers how to write secure code will solve the security problem. Training your team may help some, but the results are likely to be disappointing. In my experience, good programmers write good code and other programmers do not. In addition, good programmers are now, and always have been, in short supply.
From a management perspective, isolated partitions make sense. You go to war with what you have, and what you have is not likely to be an all-star team. Isolated partitioning requires perseverance rather than great programming skill. It allows you to put your best programmers on the most important partitions and your other programmers on the other partitions. Isolation guarantees protection of important partitions from less important partitions. Although a hacker might be able to hack one of the weaker partitions, he will not be able to get to the good stuff in the important partitions. Hence your device is safe from serious damage due to being hacked.
- Arm Ltd., Arm Platform Security Architecture Security Model 1.0, Feb 21, 2019.
- Arm Ltd., Arm Platform Security Architecture Firmware Framework 1.0, June 19, 2019.
- Arm Ltd., Arm Platform Security Architecture Trusted Base System Architecture for Arm v6-M, Arm v7-M and Arm v8-M 2.0, Dec 13, 2019.
- The White House, Executive Order on Improving the Nation’s Cybersecurity, May 12, 2021.
|Ralph Moore is a graduate of Caltech. He and a partner started Micro Digital Inc. in 1975 as one of the first microprocessor design services. In 1989 Ralph decided to get into the RTOS business and he architected the smx RTOS kernel. After 20 years of selling Micro Digital products and managing the business, he went back into product development. Currently he does the whole job from product definition, architecture, design, coding, debug, documentation, patenting, to promotion. Recent products include eheap, SecureSMX, and FRPort. Ralph has three children and six grandchildren and lives in Southern California..|
- Achieving full MCU partition isolation: Fundamentals (Part 1 in this series)
- Achieving full MCU partition isolation: MPU management (Part 2 in this series)
- Achieving full MCU partition isolation: Heaps (Part 3 in this series)
- Achieving full MCU partition isolation: Portals (Part 4 in this series)
- A step-by-step process for achieving MPU security
- Understanding virtualization facilities in the ARMv8 processor architecture
- Firmware Security – Preventing memory corruption and injection attacks
For more Embedded, subscribe to Embedded’s weekly email newsletter.