A step-by-step process for achieving MPU security

Most Cortex-M MCUs, both in the field and in development, have Memory Protection Units (MPUs). However, because of a combination of tight schedules to deliver products and difficulty using the Cortex-M MPU, these MPUs are either under-used or not used at all. The apparent large waste of memory due to the MPU requirements that MPU regions be powers-of-two in size and that they be aligned on size boundaries has been an additional impediment for adoption by systems with limited memories.

Yet for these MCUs, the MPU and the SVC instruction are the only means of achieving acceptable security. Therefore, I set out a year and a half ago to determine if the problems with the MPU could be overcome and if it were possible to devise a practical way to upgrade post- and late-development projects, as well as new projects to use MPU security. I have found that it is practical to do this and MPU-Plus has been developed to ease this process.

Part 1 introduced the key elements of this porting process and you should read that article before continuing with this article, which describes the step-by-step process for completing the conversion.

Step-by-Step MPU Security Conversion

Start by identifying the most untrusted or vulnerable task or partition that you wish to isolate from the rest of the system. This might be a networking partition or third-party SOUP[1]. It could be the site of a recent hack where it has been decided that it is easier to isolate the vulnerable code than to fix it. We recommend an incremental approach to improving system security. Significant gains can be made by isolating one bad actor at a time. However, this approach also works for porting multiple partitions at once, if necessary or desirable. Figure 4 illustrates the conversion process:

click for larger image

Figure 4: Step-By-Step Conversion Process (Source Micro Digital)

1. Start

To start, put a call to sb_MPUInit() near the beginning of the startup code. This turns on the MPU and enables its background region. Your application should run normally. Note: disable loading sys_code into MPU[5][2] and sys_data into MPU[6] in sb_MPUInit(), since these regions have not yet been defined.

2. System Regions

Next, define .sys_code and .sys_data sections. sys_code should contain all handler and ISR shell[3] code. If an ISR does not use a shell, then the ISR, itself, must be included. This is done as in the following examples, first for assembly code:

SECTION `.sys_code`:CODE:NOROOTTHUMBsmx_PendSV_Handler:   smx_MPU_BR_ON    ; turns on MPU background region   … ; Handler code   smx_MPU_BR_OFF   ; turns off MPU background region     cpsid   f   sb_INT_ENABLE   pop     {pc}then for C code:#pragma default_function_attributes = @ ".sys_code"void sb_OS_ISR0(void)  /* ISR shell */{   smx_ISR_ENTER();  /* turns on MPU background region */   /* ISR body or call ISR here         smx_ISR_EXIT();           /* turns off MPU background region */}#pragma default_function_attributes =

sys_data contains the System Stack (SS)[4]. The BR macros can be eliminated wherever sys_code and sys_data are all that is needed for an ISR or handler to operate.

Then in the linker command file:

define exported symbol scsz = 0x1000;define exported symbol sdsz = 0x400;define block sys_code  with size = 0x1000, alignment = scsz                                           {ro section .sys_code};define block sys_data  with size = 0x400,  alignment = sdsz                                           {block CSTACK, block EVT};

Of course, actual sizes depend upon the application. They should be the next power of two that is large enough. The alignment must equal the size. Now re-enable loading sys_code into MPU[5] and sys_data into MPU[6] in sb_MPUInit(). These are permanent regions that are present for every ptask as shown in Figure 2. They may also be present for utasks, but are not accessible in umode because they are privileged regions.

3. Super regions

The next step is to define super regions for SRAM, ROM, DRAM, other memories, and I/O areas in your system. These serve as temporary replacements for BR until partition- or task-specific regions are defined. Consult the linker map to determine the starting address and how much memory is being used in each memory. Then pick the next larger power of two for the size. The following template is an example:

const MPA mpa_tmplt_sr ={  {0x20000000 | V | 0, PRW_DATA | N67 | (0x11 << 1) | EN}, /* SRAM in use */  {0x00200000 | V | 1, PCODE    | N57 | (0x11 << 1) | EN}, /* ROM in use */  {0xC0000000 | V | 2, PRW_DATA       | (0x10 << 1) | EN}, /* RAM in use */  {0x40040000 | V | 3, PRW_DATA       | (0x11 << 1) | EN}, /* Synopsys HS */  {0x40011000 | V | 4, PRW_DATA       | (0x09 << 1) | EN}  /* UART1 */                          };

Super regions encompass all other regions for each memory or I/O area. Hence it is simpler to use physical addresses and sizes, as shown above, rather than complicating the linker command file by defining super blocks in it. Load this template into the Memory Protection Array (MPA) after creation, for each task being converted. For example:

  smx_Idle = smx_TaskCreate(ainit, PRI_SYS, 500, SMX_FL_LOCK, "idle");#if SMX_CFG_MPU   smx_TaskSet(smx_Idle, SMX_ST_MPA, (u32)&mpa_tmplt_sr);#endif

When a task’s MPA is loaded, its mpav flag is set. Tasks with mpav set run with BR off; all other tasks run with BR on. This allows working with one task, at a time. It also allows leaving tasks alone that are intended to stay in pmode.

Now run the system. The targeted task is likely to get Memory Manage Faults (MMFs). This indicates that it needs access to other things, such as functions, static variables, or peripherals. Dealing with this problem may require enlarging a region more than one would like. However, this is the preferable approach at this time if code changes can be avoided. (Code changes are best left for future passes, when more is known about what is needed.)

For each task running with BR off, a security gain has just been made: handlers, ISRs, and other tasks are running as they were before, but this task is running with reduced memory regions and these regions have strictly controlled attributes (RO, XN, etc.) It is quite possible that latent bugs will start to show up and be fixed – especially if the SOUP is thick and the comments are thin.

4. Pmode operation

The next step is pmode operation. For simplicity, we will assume that a single task, taskA, is being isolated. Tasks processed in this step must be running in super-regions with BR off. Hence, the sys_code and sys_data regions are required in MPU[5] and MPU[6] to handle exceptions.

The first step is to group code and data into task-specific regions and to define blocks in the linker command file to hold these regions. It is convenient to name them after the task, e.g.: taskA_code and taskA_data or name them after the partition, e.g. usbh_code and usbh_data .

Next, define common code and data regions to hold RTOS and other system services and to hold common data needed by them. We have named these pcom_code and pcom_data , respectively. At this point, taskA is a ptask, so pcom_code needs to include the RTOS and other system services needed by taskA and pcom_data needs to include data needed by these services

Then, create mpu_tmplt_taskA and add code to load it into the MPA for taskA, as shown in step 3. At this point the mpa_tmplt_sr has been replaced by mpu_tmplt_taskA for this task. taskA is standing alone and is partially isolated from all other tasks. Will it run? This is where “the tire meets the road”. MMFs from taskA are likely to be due to references outside of its regions or due to attribute violations (e.g. writing to ROM). The former indicates that the task or partition needs access to other code or other data than expected.

Solving MMFs may consist, in many cases, of just moving taskA-specific code and data into taskA_code and taskA_data regions, respectively. Assigning regions to tasks varies from task to task – even within the same partition. Some tasks may not need all of the code and data regions, but may need other regions, such as I/O regions. Standard C library calls are also likely to show up. Their libraries can be included in pcom_code in the linker command file.

SB_MPA_SIZE can be increased up to 5 for pmode. If this is not enough, it may be necessary to merge one or more regions into larger regions. For example, in mpa_tmplt_usbh for pmode (see Figure 2), the UART1 and Synopsys regions in have been merged into a region called apb0_csr. This region encompasses all APB0 peripherals, which is undesirable. However, this move will be reversed in the next step.

5. Umode operation

The final steps are to set SMX_CFG_UMODE to 1 and to move taskA into umode. This is done by setting its umode flag, as follows:

#if SMX_CFG_UMODE  smx_TaskSet(taskA, SMX_ST_UMODE, 1);#endif

This normally follows the smx_TaskSet() shown in step 4. Now, when taskA is dispatched, the sb_PendSV_Handler() will set the processor CONTROL register = 3 just ahead of starting the task. This causes the processor to run in unprivileged thread mode using the task’s stack. When a task is resumed, the EXE_RETURN value (0xFFFFFFED) also causes this.

In addition, add:

#if SMX_CFG_UMODE#include "xapiu.h"#endif

ahead of taskA code that calls system services in umode. This forces the SWI API to be used by taskA for these calls. taskA functions that run only in pmode (e.g. initialization) should be moved to the tops of taskA modules or put into their own module(s). In the former case, #include “xapiu.h” is placed after them and it is effective to the end of the module; in the latter case it should not be placed in the module at all. #include “xapiu.h” also is not needed ahead of functions that do not make any calls listed in it.

If a function is missed that makes a listed call, the call will cause a MMF when the function runs in umode, so this is easy to find and fix.

Before actually running taskA, mpu_tmplt_taskA must be changed. taskA_code and taskA_data regions should stay the same. However, pcom_code and pcom_data must be replaced with ucom_code and ucom_data. At the very least smx and system service call functions must be replaced with the smx and system service shells. However, there is likely to be much more than this to do.

As shown in mpa_tmplt_usbh, in Figure 2, the syn_csr and ur1_csr regions have been split apart. This significantly improves security by preventing the usbh and fs tasks from accessing all of the other peripherals on the APB0 bus, as was possible in step 4. Also, SB_MPA_SIZE, must be increased to 6. The BR flag in MPU_CTRL is now automatically set to 1 for utasks in order to handle interrupts and exceptions, and sys_code and sys_data are no longer needed by utasks, but they are still needed by isolated ptasks. MPA[6] can be used for an additional region, if needed, by increasing SB_MPA_SIZE to 7. Thus, there are up to 7 regions available for utasks and only up to 5 regions for isolated ptasks. The 8th region, MPU[7], is reserved for task stacks and is controlled by the task scheduler.

When taskA first starts running as a utask, you are likely to see both MMFs and PRIVILEGE VIOLATION errors. The latter indicate that restricted service calls are being made by taskA. This may necessitate recoding to not use those services in taskA, but to call them from a ptask, instead. Or it may work better to split taskA into a ptask, which calls these services and a utask, which does not. Alternatively, taskA could start as a ptask, make all of the restricted service calls, then restart itself as a utask. (It must restart itself so that the PendSV_Handler() will change CONTROL to 0x3.)

Restricting system services for utasks is important because if malware in a utask can make calls such as deleting another task or shutting down the system, security obviously is not good. However, it may be necessary to permit these weaknesses in the first pass in order to avoid excessive recoding. It may help that different versions of xapiu.h can be applied to different partitions. Thus, relatively good code might be allowed more liberties than not so good code.

If converting a ptask to a utask results in it sharing code or data with another ptask, there is likely to be a problem. Potential solutions are:

  • Putting all tasks into one partition and converting the whole partition, at once, instead of task by task. This is ok if all of the ptasks are destined to be utasks.

  • Defining a common region accessible by both tasks. This might be acceptable for common code such as C libraries, but is not acceptable for pdata.

  • Passing global data via messages or pipes.

  • Splitting taskA into a ptask and a utask, where the ptask shares pcode and pdata with other ptasks and the utask does not.

  • Starting a task as a ptask to perform pfunctions required for task initialization, then switching it to a utask that performs only ufunctions.

  • Replicating common routines using different names.

Restructuring tasks often involves minimal code change – the complex code often resides in subroutines called from tasks and they may not be impacted.

Size and Performance Impact

MPU-Plus adds about 2000 bytes of code to an application. It adds 8*SB_MPA_SIZE * NUM_TASKS bytes of RAM and 8*SB_MPA_SIZE bytes of ROM per MPA template to the application code. For the example port, wasted code space was only 14.5% and wasted data space was an even smaller 2.5%. Since this is a large amount of code that was not tailored to the MPU, these are probably typical numbers. If the wasted code space is still too much, using smaller regions will help (the data space has smaller regions than the code space). There are also other techniques that can be applied.

Since I/O regions are directly accessible in umode and since code runs as fast in umode as in pmode, performance reduction is primarily due to overhead on system services and task switching. Execution overhead for typical system services, such as smx_SemTest() or smx_SemSignal() is about 100%. Percent overhead is less for longer system services, since the overhead is a fixed amount of time per system service. Overhead for task starting is about 25%, for loading the task’s MPA into the MPU, and for other MPU conditional code. It is about half of this for task resumption and negligible for task suspension or stopping.

Conclusion

The foregoing demonstrates that there is a feasible step-by-step process to incrementally improve the security of Cortex-M embedded systems that have MPUs and that this process can be performed on systems already released and by people who did not write the code. Although substantial work may be required, a clear path has been defined to achieve security objectives in a cost-effective manner. This is better than doing nothing and should be considered before disaster strikes.

NOTES:

[1] Software of Unknown Pedigree.

[2] MPU[N] is used herein to mean MPU slot N.

[3] An ISR shell is a C or assembly wrapper that calls the body of an ISR. These are typically not needed for Cortex‑M processors, but they may be used by the RTOS for generality, and they are useful here to minimize the amount of code to put into .sys_code.

[4] Also called the Main Stack (MS) by ARM Ltd.


Ralph Moore graduated with a degree in Physics from Caltech. He spent his early career in computer research. Then he moved into mainframe design in the 60's and became a consultant in the early 70's. He taught himself programming and became a microprocessor programmer. He founded Micro Digital in 1975, and many years of successful consulting lead into architecting and developing the smx kernel in 1987. For many years he managed the business and sales, but in recent years he has been focused almost solely on development of the smx multitasking kernel v4. He can be reached by email at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.