Dealing with memory access ordering in complex embedded designs -

Dealing with memory access ordering in complex embedded designs

Things used to be so simple in the embedded world. For most of us, the systems for which we develop these days are orders of magnitude more complex than the ones we were using even five years ago.

As embedded systems chase ever higher performance, processor designers reach deeper and deeper into the toolbox for microarchitectural innovations. Many of these, mercifully, are transparent to the programmer. The challenge for us is that many are not transparent; we to be aware of what is going on and write our software in different ways. In some cases, we are missing out on improved performance, but in many cases, existing software techniques simply won’t work properly unless we take into account some of the new ways in which modern embedded systems function.

The area I address in this article is memory accesses, specifically the order in which they happen. The simple act of loading, storing, and transferring data between processor and memory is much more complex than it used to be. Consider the following simple example:

     LDR r0, [r1]     ; read UART data input register
     STR r4, [r1, #4] ; write control register
     STR r0, [r1, #8] ; write to UART data output register

In some imaginary peripheral, we are reading input data, carrying out a control operation, and then writing the data out again through a different port. Presumably, the data sheet tells us that the control operation must be complete before the data is written out. On a simple system, we can guarantee that the memory operations will complete in the order they appear in the program and that each will complete before the next is started.

On today’s systems, that isn’t true any more.

Caches could mean that the LDR is satisfied by a cache access and never accesses the real hardware. Write buffers could mean that the STRs don’t happen in the order in which they are written.
Both effects, coupled with the compiler re-ordering instructions, could mean that the first STR is
executed before the LDR or after the second.

Clearly, we wouldn’t be so dumb as to access memory-mapped peripheral registers via a cache (at least, I hope so!) but write buffers can catch us unawares.

Figure 1 shows what a simple system might look like.

Figure 1: In a single processor system connected to a number of memory devices, none are able to act independently on the system bus, nor talk directly to each other.

A single processor connects via a single system bus to a variety of memory devices. None of these devices are capable of acting independently on the bus (i.e. they do not initiate bus accesses of their own), nor do they talk directly to each other.

Such a system is relatively simple to manage. In the absence of effects within the processor, memory coherency is easy to manage, memory accesses occur in order, and so on. It obeys the “Sequential Execution Model” (SEM). In short, things happen in the order in which you write them.

A more complex system, on the other hand, might look like what is shown in Figure 2 .

Figure 2: In a complex system with multiple CPUs linked to multiple component over a multilayer bus matrix, several are able to act autonomously and independently, take control of the bus, and talk to each other.

Here we have multiple processors talking to multiple system components via a multi- layer bus matrix. Amongst the system components are several which are capable of acting autonomously and independently of the processor. They can access data independently and take control of the bus without intervention of the processor. They can also talk directly to each other.

There are many opportunities in such a system for the SEM to break down. There are multiple software threads executing on multiple processors, there are caches and buffers and autonomous memory access units such as DMA controllers.

Analyzing everything going on in such a system is beyond the scope of this article but we will look at the most common behaviors and what you need to do about them in software.

There are two distinct effects which we need to consider:

  • Compiler behavior
  • System behavior

We need to write our software so that the compiler produces what we expect. But we also need to know how the system behaves so we can ensure that output code actually has the desired effect.

Compiler behavior
Compilers are bound by a strict andspecific Sequential Execution Model of their own that applies at thelevel of high-level language statements and is propagated through tooutput machine instructions, executing on a tightly-defined virtualmodel of an idealized target machine. But this model breaks down whenthe Sequential Execution Model doesn’t apply to the real machine onwhich the program is executed. There are lots of reasons why this mightbe the case.

The use of the “volatile” keyword is nothing morethan an indication to the compiler that a particular value, held inmemory, may change when it’s not looking. The classic example would be amemory-mapped peripheral register representing a FIFO buffer. Everytime you read it, you get the next input data item.

     volatile int *fifo;
     int input;

     while (1)
        input = *fifo;

Withoutthe volatile keyword in the declaration of the FIFO item, the compilerwould be free to cache the value after reading it once and simply reusethe value without ever reading memory again.

The volatile declaration tells the compiler that it can’t cache the value and so hasto physically read memory every time. (We are assuming here that thereisn’t a cache in between the processor and this particular memorylocation.)

But we also need to apply “volatile” to any globalvariable which may be changed by an interrupt handler. Likewise, suchvariables may change when the compiler isn’t looking.

        status = 1;
        while (status == 1)
           //do stuff …

        status = 0;

If status is not declared as volatile, this program won’t necessarily work, asthe compiler is free to assume that it never has to read it in the mainloop after setting it in the first line.


System behavior
Wecan divide this into the behavior of the processor itself (pipeline andexecution units), the memory system (write buffers, caches, externalmemory systems), and, finally, multiprocessing systems.

Modernprocessors employ increasingly complex pipelines to maximize instructionthroughput. The latest ARM cores, for instance, incorporatesuperscalar, out-of-order pipelines with multiple execution units. Thismeans that, regardless of what the compiler produces, the process itselfcan execute instructions in a different order to that in which theyappear in the program.

The processor has to be able to satisfyitself that there are no data, address, or resource dependencies betweenthe instructions but, if it can do that, it’s pretty much free to dowhat it wants. The key restriction is that the Sequential ExecutionModel has to apply but that the processor can only ensure it applies towhat it can see, which is itself and some closely-coupled components. Itknows nothing about the rest, so cannot ensure that the model applieswhen extended outwards into the system. Consider the following sequenceof instructions:

     add r0, r0, #4
     mul r2, r2, r3
     str r2, [r0]
     ldr r4, [r1]
     sub r1, r4, r2 bx lr

If we execute this on a simple in-order processor, we might see something like this:

         Cycle Instruction

    0 add r0, r0, #4
    1 mul r2, r2, r3
    2 *stall*
    3 str r2, [r0]
    4 ldr r4, [r1]
    5 *stall*
    6 sub r1, r4, r2
    7 bx lr

On an out-of-order processor, we might see this:

           Cycle Instruction
     0 mul r2, r2, r3
     1 ldr r4, [r1]
     2 str r2, [r0]
     3 sub r1, r4, r2
     4 bx lr

Theprocessor has re-ordered the execution sequence to allow the LDR tostart execution prior to the STR. This gives it more time to completeand reduces the overall latency of the sequence. It can do this providedit is satisfied that there is no dependency between the twoinstructions. Clearly, in this case, that’s true.

Or is it?

Whatif the STR was accessing a peripheral register and that access neededto be complete before the LDR is executed? In that case, the programwould not function correctly. We need to fix this by inserting a memorybarrier between the two instructions. A barrier is an instruction whichtells the processor that it needs to ensure, with some degree ofstrictness, that outstanding memory accesses are complete before itcontinues. So, we would need to write this:

     add r0, r0, #4
     mul r2, r2, r3
     str r2, [r0]
     ldr r4, [r1]
     sub r1, r4, r2 bx lr

The DMB (data memory barrier) instructs the processor to ensure that no morememory accesses take place before all prior accesses have completed.This includes things like flushing write buffers and so on.

ARM memory barrier instructions
Particular memory barrier instructions implemented in current ARM systems:

DMB Data memory barrier ensures that no more memory accesses occur until all outstandingaccesses have completed. Does not stop the processor continuing toexecute instructions as long as they don’t cause memory accesses.

DSB Data synchronization barrier causes the processor to stall, without executing any furtherinstructions, until all outstanding memory accesses have completed.

ISB Instruction synchronization barrier causes the instruction pipeline and any prefetch buffers/queues to be flushed and instructions to be refetched.

Actually,some of this can be fixed more easily on ARM systems by making use ofthe architectural memory types. Architecture ARMv6 onwards supportssomething called a “weakly-ordered memory model”. This means that theprocessor and memory system are free to use all kinds of tricks to hidememory latency. This includes:

  • Speculative accesses Speculative accesses are used heavily by the instruction fetcher to fetch ahead of the current execution point and to speculatively fetch multiple possible instruction sequences following a conditional branch. They can also be generated by the data side memory interface to speculatively load data into the cache based on observations of repeating access patterns at runtime. Note also the cache line fills are also speculative memory accesses in the sense that the loaded data may never be used.
  • Merging memory accesses Many write buffers automatically merge multiple accesses to consecutive or overlapping addresses into single or burst transactions.
  • Re-ordering memory accesses Where there are no data dependencies among a group of transactions, the memory system is free to carry out these access in the most efficient order.
  • Repeating memory accesses In some circumstances, the system will repeat accesses. In ARM systems, this can occur if a LDM or STM instruction is interrupted (in low latency interrupt mode). When this happens on an ARMv7-A/R processor, the access is restarted on return from the exception handler and this means that some of the accesses may be repeated.
  • Changing memory access size and number If it more efficient for the memory system to carry out an access of a different size than that specified by the program and then to extract the necessary portion of the loaded value before returning it to the processor, then it is free to do this. It is also free to split large accesses into multiple smaller ones, if this is more efficient for the memory system.

Clearly, systems are only allowed to do thiswhen the effects are not observable to the executing program. And thisis generally only true when memory accesses have no side effects. Formost memory accesses, this is true. It is not true for accesses tomemory-mapped peripherals. To make this distinction, the ARMarchitecture defines different types of memory.

ARM memory types
Memory types as defined in ARMv6 and ARMv7 architectures:

Normal memory is “weakly-ordered”. Instruction memory is always Normal (thearchitecture actually requires this) and it is also used for the vastmajority of program data. Normal memory regions may be cached and mayuse a write buffer.

Device memory obeys a much morestrictly ordered memory model. In particular, memory access size,number, and order must be preserved and accesses may not be repeated.Speculative accesses are not permitted. Device memory is generally usedfor memory-mapped peripherals and any other addresses where accesseshave side-effects. It may not be cached but, since write buffers arepermitted a Device access, may “complete” before it reaches theaddressed device.

Strongly-ordered memory is even stricterthan device memory and is used only for support for legacy systemswhere memory ordering is a particularly problem. Write buffers are notpermitted and a strongly- ordered access “completes” when it reaches theaddressed device.

On devices with an MMU (Cortex-A), the memoryregions definitions are contained within the page descriptors in the MMUpage tables. On those with an MPU (all Cortex-R and some Cortex-M),they are contained in the MPU region attributes. Cortex-M devices thatdo not have an MPU have a fixed memory map in which the type of eachregion is fixed.

From the descriptions of these memory types, youcan see that defining memory-mapped peripheral regions as device memorysolves many of the problems we have been talking about. Indeed, whenworking on ARM systems, it is pretty much mandatory that you do this!

A common situation is the need to reset a hardware device before reading its status. You might write something like this:

     volatile unit32 control; //write register to reset device
     volatile uint32 status; //read register to access status
     uint32 x;

     control = 1; // reset device

     // some code

     x = status; // read status while ((x & 1) != 1)
          x = status;

Youmight think all is well. At least the programmer has declared thememory-mapped peripheral registers using the volatile keyword. Butpresumably, the write to the control register should complete before weread the status register. Otherwise the device will not be resetproperly before we access it. This code does not guarantee that.

Forreasons we have seen above, the compiler may promote the LDR from thestatus port above the STR to the control port because load latency islonger than store. Similarly, a multi-issue out-of-order execution unitin the processor might issue them in a different order. We need some wayof ensuring that they happen in the order in which they are written. InC, we can use an intrinsic function to insert a suitable memorybarrier.

     volatile unit32 control; //write register to reset device
     volatile uint32 status; //read register to access status
     uint32 x;

     control = 1; // reset device


     // some code

     x = status; // read status while ((x & 1) != 1)
          x = status;

Thereare other cases where you have to be careful. For instance, the systemonly ensures that memory accesses are complete from the point of view ofthe processor. It may be that certain memory accesses either have sideeffects that are beyond the knowledge of the processor or that they taketime to occur. Let’s look at some instances of this:

Whenfinishing up an interrupt handler, one key operation is to clear downthe peripheral that is signaling the interrupt. Clearly you need to dothis before returning, otherwise you’ll get re- interrupted immediately.You might use code like this.

str r1, [r0] ; write interrupt clear register
dsb          ; ensure operation complete
rfe sp!

Withoutthe DSB instruction, the processor would execute the RFE immediatelyafter the STR. Since write buffering is permitted for device memoryaccesses, the interrupt handler could exit before the interrupt isactually cleared. The DSB instruction ensures that the processor stallsuntil the memory write is complete.

It is fairly common on ARMsystems to reconfigure the memory system address map at runtime. In thepast, we might have done this via a memory-mapped peripheral thatmodified the behavior of the address decode logic, like this:

     ldr f0, #REMAP
     str r1, [REMAP] ; reconfigure address map
     dsb              ; ensure write has completed
     b NewCode

The DSB is required to ensure that the STR operation is complete (to the memory system and not just the processor) before the program continues. The ISB ensures that subsequent instructions are loaded in the new context. You might think that putting the REMAP port in device memory would solve this problem but it doesn’t.Crucially, the ordering rules for device memory accesses only apply withrespect to other device memory accesses. Normal memory accesses(including instruction fetches) can be moved around among them withouthindrance. So, the DSB is important to ensure that subsequent instruction fetches do not take place until the operation is complete.

These days, we do it more often by reprogramming the MMU. That requires some TLB maintenance too.

     str r11, [r1] ; update page table
     dsb            ; ensure write has completed
     tlbimva r10   ; invalidate affected TLB addreesses
     bpiall        ; flush branch predictor
     dsb            ; ensure completion of both operations
     isb            ; synchronize instruction context

TheDSB instructions are required to ensure that the side effects of thecontext-changing operation (update page tables and invalidate TLB) arecomplete before the program continues. The ISB instruction is requiredto ensure that all subsequent instructions are loaded in the new contextrather than the old.

Moving from single-core, single-threaded systems to multi-threaded, multicore systems opens another can of worms!

Itis obviously important that multiple processors that share memory have aconsistent and coherent view of the contents of that memory. Supposeone processor in a system updates two memory locations, X and Y, in thatorder. We might assume that other processors reading Y then X wouldread either:

  • New values for both X and Y
  • The old value for Y and the new value for X
  • The old values for both X and Y

Weshould also be able to assume that they will NEVER see the new value ofY and the old value of X. We used to be able to make that assumption,when the SEM held across whole systems, but we cannot make thatassumption any more.

In multicore/multi-processing systems, theprocessors almost certainly share regions of memory through which theycommunicate, share data, and pass messages. This could be a common heap,an operating system message pool, or a frame buffer share between anapplication processor and a GPU.

In the following sequence, in which order are A and B loaded from memory?

     LDR r0, [A]
     LDR r1, [B]
     ADD r2, r0, r1
     STR r3, [C]

IfB is cached and A is not, then B may actually be loaded before A (inthe sense that the LDR for B will complete before the LDR for A). Thismay not matter, but it may do so if either variable is being updated byan external agent. If the values are in some way correlated in time,then this will cause problems.

In order to provide controlledaccess to critical sections, we might implement some kind of lock orguard code that might look like this:

     LDR r0, [S]
     ADD r2, r1, #1
     STR r2, [S]
     MOV r3, #0
     STR r3, [LOCK]

Thisassumes that S is updated in memory before LOCK. That may not be trueif S is cached and LOCK is not. Or if a write buffer chooses to re-orderthe writes. You might be tempted to fix this problem by placing thelock variable in device memory (or possible device shared memory) butplacing all shared memory in device regions is going to haveunacceptable effects on performance.

The solution is either touse memory barriers or to employ a more robust form of locking. ARMprovides exclusive access functionality via the LDREX/STREX pair ofinstructions. When used together, these allow a programmer to implementrobust lock constructs as used by most popular operating systems.

Load and store exclusive
ARM’s LDREX and STREX exclusive access instructions:

LDREX – The load exclusive instruction carries out a load from an addressedmemory location and also flags that location as reserved for exclusiveaccess. The flag will be cleared by a subsequent store to that location.

STREX – The store exclusive instruction stores from a register to anaddressed memory location and returns a value indicating whether theaddressed location was reserved for exclusive access. If it wasn’t, thestore doesn’t take place and memory is unchanged. The exclusivereservation is cleared regardless of whether the store succeeds or not.

CLREX – Clear exclusive is intended for use in context switches, the CLREXinstruction clears any exclusive access reservations in the memorysystem.

Since an exclusive reservation is cleared by anysubsequent store, exclusive or not, these instructions can be used by alock construct, such as a mutex, to set a new value for a lock variableonly if no other program has done so since this particular programchecked its value.

A lock routine might look like this:

     ; void lock(lock_t * addr)
        LDREX r1, [r0]     ; check current lock value
        CMP   r1, #LOCKED
        BEQ   lock         ; try again

        MOV   r1, #LOCKED
        STREX r1, r2, [r0]
        CMP   r2, #0       ; if store failed, try again
        BNE   lock

        BX    lr

And the corresponding unlock function:

     ; void unlock (lock_t *addr)
        DMB               ; ensure accesses have completed
        MOV r1, #UNLOCKED
        STR r1, [r0]
        BX lr

Note that an STREX is not required to clear the lock, only when testing and setting it.


     int flag = BUSY;
     int data = 0;

     int somefunc(void)
        while (flag != DONE);
        return data;

     void otherfunc(void)
        data = 42; flag = DONE;

What would you expect somefunc() to return? 42? Well, that’s possible. In a multi-threaded system, so is 0!

Chris Shore is Training Manager at ARM headquarters in Cambridge, UK. In that role for the past ten years hehas been responsible for the worldwide customer technical training team,coordinating the delivery of over 80 courses per year to licensees allover the globe. He is passionate about ARM technology and, as well asteaching customers, regularly presents papers and workshops atengineering conferences. Starting out as a software engineer in 1986,his career has included software project management, consultancy,engineering management, and marketing. Chris holds an MA in ComputerScience from Cambridge University. This article was presented at theEmbedded Systems Conference as a part of a class that Chris Shore taughton “Memory Access Ordering in Complex Embedded Systems” (ESC-231).

1 thought on “Dealing with memory access ordering in complex embedded designs

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.