Achieving full MCU partition isolation: Heaps -

Achieving full MCU partition isolation: Heaps

Achieving full microcontroller (MCU) partition isolation is essential for achieving high security for MCU-based systems. Continuing a series on this complex topic, this article discusses the need for multiple heaps and the heap features that are useful in partitioned embedded systems.

This is the third article in a series on achieving high security for MCU-based systems. The first article covered security basics and partitions; the second article covered MPU management. In this article, we cover the need for multiple heaps and the heap features that are useful in partitioned embedded systems. The right heap is important for achieving full MCU partition isolation. This is not intended as a tutorial. Ref. 2 may be helpful for that.

There was a day when embedded systems were so simple that heaps were seldom used. However, since then, complexity has grown so much that most embedded systems, even RTOSs, now use heaps. If only one heap is in use, heap efficiency is not of much concern. However, in typical partitioned systems, there may be several heaps. Multi-heap support becomes necessary because using a common heap between partitions breaks partition isolation. A hacker can wreck a common heap and bring all partitions using it down or possibly invade them through the common heap. In a multi-heap situation, heap efficiency and features become important. Below, various heap features are discussed that are useful in fully partitioned embedded systems.

Doubly Linked Chunks

Heaps consist of linked chunks. In each chunk is metadata used by the heap and the data block returned to the user. In dlmalloc (see Ref. 1) and many of its derivatives, the block size is at the start of each chunk and also at the end of each free chunk. This achieves a very high memory efficiency – only 8 bytes per inuse chunk. However, it is only possible to trace forward (by adding sizes to the heap start), not backward through the heap.

A better design for embedded systems is a forward link and a backward link in every chunk. This permits continual forward heap tracing during idle periods to find broken links, and when a broken link is found, backward tracing to fix it. (Backward tracing is also necessary to check each link when tracing forward.) Heaps contain vital information such as task stacks, memory protection arrays, and other system and application control structures. Hence, heap self-testing and self-healing is important for embedded systems exposed to background radiation, electrical disturbances, excess heat, and other environmental phenomena encountered by many embedded systems — not to mention hackers.

Heap Bins

dlmalloc was one of the first heaps to use heap bins, and most other heaps developed since also use them. A heap bin consists of doubly linked free chunks, in a certain size range, linked to the bin header. Normally the smallest-size bins have only one size and are organized by successive chunk sizes (e.g. 24, 32, …, 56) up to a maximum size. This is referred to as a Small Bin Array (SBA). Allocating or freeing an SBA chunk is very fast – the size is used as an index into the SBA, and the first chunk is taken or returned. Above the SBA are the upper bins [1], which consist of a mixture of small bins and large bins. Small bins have one size; large bins have a range of sizes (e.g. 128 to 248, 256 to 504, etc.). The last upper bin is called the top bin, and it has all of the remaining sizes.

Accessing the correct upper bin requires size comparisons, then searching the bin for the best match, and if a big-enough chunk is not found, taking the first chunk in the next larger bin. This can be time-consuming, but not as bad as searching a serial heap with no bins. Each upper bin is, in effect, a sub heap.

The problem with dlmalloc and its variants is that the number of bins and trees is fixed. This may be ok for the main heap, but it is not a good match for partition heaps. A given partition may use only a few block sizes, hence it needs only a small SBA, if any, and only a few large block sizes. Thus a configurable bin structure is highly desirable for partition heaps, both for memory efficiency and for performance. For example, the following bin structure might work well for a network partition:

u32 const binsz2[] =
/*bin  0    1     2     3     4    end */
      {24, 512, 1024, 1526, 1534,  -1};

In this case, there is no SBA. The first bin covers all chunk sizes from 24 to 504 [2]. The next two bins cover all chunk sizes up to 1518. Bin 3 has a single chunk size of 1526, which can hold the maximum Ethernet frame size of 1518 bytes. Like an SBA bin, bin 3 requires no searching – the first chunk is taken or placed. Bin 4 is the top bin and it contains all chunk sizes from 1534 on up.

An even simpler partition might contain only one bin:

u32 const binsz1[] =
/*bin  0   end */ 
      {24, -1};

This could be a partition that allocates mostly permanent blocks. In this case, the chunks would come primarily from the top chunk [3] and seldom from bin0.

Large Bin Sorting

Generally, embedded systems have significant idle time in order to be able to handle peak loads. Large bin sorting can be done during idle times. If a large bin is sorted, the first big-enough chunk is also the best-fit chunk in the bin. Assuming the rule is to always take the first big-enough chunk, sorting reduces unnecessary chunk splitting and saves bigger chunks for larger requests. (Note: Finding the best-fit chunk in an unsorted bin usually takes too much time.)

Merge Control

When a chunk is freed it may be merged with adjacent free chunks. dlmalloc has no merge control – freed chunks are always merged with adjacent free chunks. This tends to defeat the purpose of bins – it results in removing a free chunk from a lower bin, merging it with the freed chunk, and placing the resulting chunk into a higher bin. The next time this chunk size is needed, its bin may be empty and it may be necessary to get a larger chunk and split it. Hence two unnecessary operations have occurred: merge and split. This hurts performance.

A better policy for embedded systems is not to merge until a certain upper threshold is reached, such as HEAP_USE_MAX, and then merge all freed blocks until a lower threshold is reached, such as HEAP_USE_MIN. This operates like a thermostat. Other criteria could be used, such as number of free blocks, total free block bytes, etc. In defense of dlmalloc, research has shown (See Ref. 2) that there is no universal solution to avoiding allocation failures due to fragmentation. Hence, merge control may be dangerous and should be turned off. However, embedded systems, particularly individual partitions, tend to have very regular behaviors and merge control probably will not cause allocation failures for them.

Heap Recovery

In the event of an allocation failure, a heap recovery function is automatically called. It traces through the heap to find enough adjacent free blocks to satisfy the request, merges them, and returns control to the allocation function. Thus the heap hiccups but does not fail.

Aligned Blocks

Dynamically allocated protected blocks and protected messages must be aligned on power-of-two boundaries in order to be used as MPU regions. An efficient process for doing this is as follows:

  1. Find the first large-enough free chunk for the desired block size, sz.
  2. Find the first alignment boundary, 2n, inside the chunk’s data block.
  3. Test if the remainder of the chunk is >= sz.
  4. If not, go on to the next large-enough chunk.
  5. When an acceptable chunk is found, put its Inuse Chunk Control Block (ICCB) below the boundary – i.e. just below the aligned data block.
  6. The resulting space below the ICCB is called free space, and it is handled as follows:
    1. If the preceding chunk is free, combine the free space with it.
    2. Else, if the free space is large enough, make it into a free chunk.
    3. Else, combine the free space with free space at the end of the preceding inuse chunk, and if the result is big enough, make it into a new free chunk.
  7. Split off space after the block, if large enough for a free chunk, else make it free space.

This process is illustrated in Figure 1. In this figure, ICCB = Inuse CCB, FCCB = Free CCB. In this case, option 6b has been taken and a small free chunk has been formed below the new data block. Some spare space is left at the top of this chunk because the space is not large enough to form a free chunk. Over a period of time, the heap will start to be organized into chunks having aligned data blocks, and aligned allocations will become faster.

Figure 1: Aligned Allocation. (Source: Author)

Region Blocks

The foregoing is adequate for v8M MPU regions when specifying alignment and size as multiples of 32, but v7M MPUs require additional steps:

  1. Determine the region size as the next larger power of two. For example, if sz = 630, then region size = 1024.
  2. Determine the subregion size and the number of contiguous subregions needed. In the example, subregion size = 128, and 5*128 = 640 > 630, so N = 5.
  3. Do aligned search steps 1 – 3 with alignment = 128 and size = 640.
  4. After step 3: Verify that all N subregions are in the same region – i.e. find the next region boundary (e.g. multiple of 1024) and verify that the last subregion ends before it.
  5. Do steps 5 -7.

This results in a subregion-aligned v7M region block that is contained in contiguous subregions within a region, as shown in Figure 2:

Figure 2: Region Allocation. (Source: Author)

In this example, subregions 1-5 will be enabled, and subregions 0, 6, and 7 will be disabled. Note how the disables protect the surrounding heap CCBs and spare space. Because the region block is only subregion-aligned, it is much easier to find and causes less heap disruption.

Chunk Types.

Figure 3 shows three types of chunks, each with a different Chunk Control Block (CCB) (Orange). All three CCBs have a forward link to the next CCB and a backward link to the previous CCB. In addition, the free CCB has chunk size, forward and backward bin links, and bin number. This requires 24 bytes, so that is the minimum chunk size. The inuse CCB requires 8 bytes for links, so the smallest data block is 24 – 8 = 16 bytes.

The third type of chunk is a debug chunk. The debug CCB is an inuse CCB with chunk size, time of allocation, owner, and a fence added. In addition the data block has N fences above and below it. These things are useful during debugging to find memory leaks, block overflows, and other problems. Whether an inuse chunk or a debug chunk is generated by an allocation depends upon whether the heap’s debug mode is off or on, respectively. This permits limiting debug chunks to code of interest, which is useful since they can be much larger than inuse chunks.

Figure 3: Chunk Types. (Source: Author)

Integrated Block Pools

16 bytes is rather large for many C++ objects. To alleviate this problem, smaller block pools can be integrated with the heap. Then allocations of less than 16 bytes are taken from block pools. This is very fast and commensurate with the needs of object-oriented code. A useful by-product of integrating block pools into a heap is that if a block pool runs out of blocks, the heap is used instead. This may result in slower performance, but the system does not break. When freed, blocks go back to their source.

It is likely in modern embedded systems that some partitions (especially third party software) will be written exclusively in C++. A heap with integrated block pools may result in much better performance for the partition and may use less memory.

Need for Mutexes

All RTOSs use some mechanism to protect critical sections of system service routines. Whatever method is used, the net result is that no other task can run during these critical sections. This approach is not workable for multiple heaps that share the same heap code. Instead, we must use a mutex per heap in order to limit one task to access a heap, while still allowing higher priority tasks to preempt and access other heaps.


As can be seen from the foregoing, the heap choice is an important part of partitioning a system. The right heap can improve performance and reduce code rewriting. Providing dynamic regions is also an important heap requirement. In the next article in this series, we look into partition portals to see how they work and how they are used.


  1. Doug Lea, A Memory Allocator, December 1996.
  2. Paul R. Wilson, et al. Dynamic Storage Allocation: A Survey and Critical Review, September 1995.


[1] This discussion is not applicable to dlmalloc, which uses trees; trees are much more complicated than large bins.

[2] Keep in mind that block sizes are 8 bytes less than chunk sizes, and that chunk sizes are spaced 8 bytes apart.

[3] A heap starts out as a donor chunk for the SBA and a top chunk for upper bin chunk sizes. If there is no SBA, there is no donor chunk and the whole heap starts out as a single top chunk.

Ralph Moore is a graduate of Caltech. He and a partner started Micro Digital Inc. in 1975 as one of the first microprocessor design services. In 1989 Ralph decided to get into the RTOS business and he architected the smx RTOS kernel. After 20 years of selling Micro Digital products and managing the business, he went back into product development. Currently he does the whole job from product definition, architecture, design, coding, debug, documentation, patenting, to promotion. Recent products include eheap, SecureSMX, and FRPort. Ralph has three children and six grandchildren and lives in Southern California.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.