Hardware (and software) implications of Endianness in SoC design - Embedded.com

Hardware (and software) implications of Endianness in SoC design

Embedded software programmers are familiar with the endianness characteristic of a computing processor. They are generally aware how, depending on endianness, different data types are stored in memory and the consequences of accessing individual byte locations of a multi-byte data element in memory.

In this article, we will review the concept of endianness from a software standpoint and will then discuss the implications of endianness for designers of hardware IP blocks and developers of device drivers when they work with a complex system such as today’s SOC. 

Today’s SOCs integrate many hardware IP blocks and designers need to be aware of the order of bytes on the byte lanes of connecting buses when transferring data. In a system with several discrete hardware components such as a host processor and external devices connected to it via, say, a PCI bus, the hardware components may support different endianness modes and device driver developers need to make the data transfers among these hardware components endianness-proof.

Endianness effect in software
From a software standpoint, “endianness effect” comes into play when writing and reading data from memory. Here, we digress a little to explain the use of data types in embedded software. All high-level programming languages support several data types. For example, C language supports data types such as char, int, long, float and so on. The data types differ from one another in their meaning (float data type is for representing floating point numbers, while int data type is for representing integers) as well as lengths in bytes of memory required to store elements of the data types. 

The lengths of data types of C are part standard (such as defined by ANSI C) and part compiler implementation dependent. For example, one compiler may implement char data type to be one byte in length and an int to be two bytes in length. Another compiler may implement char data type to be two bytes in length and an int to be four bytes in length. To avoid confusion on lengths and to ensure portability across C compilers, embedded software programmers often define their own data types that explicitly give the number of bytes for the data type. 

For example, the following may be defined:

typedef unsigned long uint32_t; /* unsigned type that is 32-bits long */
typedef long int32_t;           /* signed type that is 32-bits long */
typedef unsigned int uint16_t;  /* unsigned type that is 16-bits long */
typedef int int16_t;            /* signed type that is 16-bits long */
typedef unsigned char uint8_t;  /* unsigned type that is 8-bits long */
typedef char int8_t;            /* signed type that is 8-bits long */

The above definitions are specific to a C compiler/processor that implements long data type as 4 bytes long, int data type as 2 bytes long and char data type as 1 byte long. The software then uses these user defined types rather than C standard types. To make their software portable across different C compilers/processors, embedded programmers isolate these type definitions into a “port” file. 

When the software needs to work on a new compiler/processor, only the port file needs to be redone, while the software written using the user defined data types remains unchanged. For example, for a compiler that implements int data type as 4 bytes long, short data type as 2 bytes long and char data type as 1 byte long, the “port” file will have the following type definitions:

typedef unsigned int uint32_t;   /* unsigned type that is 32-bits long */
typedef int int32_t;             /* signed type that is 32-bits long */
typedef unsigned short uint16_t; /* unsigned type that is 16-bits long */
typedef short int16_t;           /* signed type that is 16-bits long */
typedef unsigned char uint8_t;   /* unsigned type that is 8-bits long */
typedef char int8_t;             /* signed type that is 8-bits long */

To illustrate “endianness effect”, consider the following section of code:

{
  uint32_t * ptr32;
  uint8_t  var8;
  ptr32 = (uint32_t  *)malloc(sizeof(uint32_t));

/* Bookmark A */
  *ptr32 = 0x44332211;
/* Bookmark B */

/* Bookmark C */
  var8 = *(uint8_t  *)ptr32;
/* Bookmark D */

}

Guess what the value of var8 would be at Bookmark D? If you said “it depends on whether the system is little endian or big endian”, you can jump ahead to the next section of this article. If you guessed 0x44 or any other value, read on.

var8 takes 0x44 value in a big endian system, while it takes 0x11 value in a little endian system.

Let us say in the above code, ptr32 takes the address value 0x80000000 from the malloc. The content of byte addressable locations starting at address 0x80000000 would look as follows at Bookmark B:

Figure 1 

That is, in a little endian system, when an element of multi-byte length data type is written to memory, the least significant byte is stored in the lowest address offset of memory. In a big endian system, the most significant byte is stored in the lowest address offset of memory. (My rule of thumb for remembering the little endian/big endian difference is LLL – Little endian, Least significant byte, Lowest address)

From the above layout in memory, it is clear that var8 gets the value 0x11 in a little endian system and value 0x44 in a big endian system.

Now consider the following section of code:

     uint32_t * ptr32;
  uint16_t  var16;
  ptr32 = (uint32_t  *)malloc(sizeof(uint32_t));

/* Bookmark E */
  *ptr32 = 0x44332211;
/* Bookmark F */

/* Bookmark G */
  var16 = *(uint16_t  *)ptr32;
/* Bookmark H */

At Bookmark H, var16 will have value 0x2211 in a little endian system and 0x4433 in a big endian system. This may seem a little tricky at first, but is consistent with the definition of little endian and big endian memory layout. When var16 get its value from address 0x80000000 using a 16-bit (2 byte) access, the two bytes starting at location 0x80000000 are read out. These turn out to be 0x11 and 0x22 in little endian system and 0x44 and 0x33 in big endian system. 

Further, in a little endian system, of the two bytes that are read out, the byte stored at the lower address is interpreted as the least significant byte. That is, 0x11 is interpreted as the least significant byte and 0x22 is interpreted as the most significant byte resulting in var16 getting a value of 0x2211 . Likewise, in a big endian system, the byte stored in the lower address is interpreted as the most significant byte.

Dealing with endianness 
How do programmers ensure that they do not get surprises when accessing selective bytes of multi-byte length elements? Do they have to write a variant of code for each endianness mode using a compile time flag and use #ifdef ’s to select the proper variant? That would be cumbersome. 

The way around it is to keep in mind that software sees the endianness effect only when mixing data types – specifically, when storing a certain byte-length element into memory and reading the same memory as a different byte-length element. So, the solution is that if a 32-bit element was stored at a memory address, the content at that memory address needs to be read out as 32-bit element only. Once it is read out of memory and is in a CPU register, the required bytes can be extracted from it.

The endianness-proof software variant of the above code section is:

{
     uint32_t * ptr32;
  uint32_t  var32;
  uint16_t  var16;      

  ptr32 = (uint32_t  *)malloc(sizeof(uint32_t));

/* Bookmark E */
  *ptr32 = 0x44332211;
/* Bookmark F */

/* Bookmark G */
  var32 = *ptr32; /* Read from memory as same byte length as what was used during writing */
                  /* var32 would now have 0x44332211 in both endianness modes */
  var16 = (var32 &  0xFFFFu); /* var16 would get 0x2211 */
/* Bookmark H */

}  

Hardware implications of endianness
Today's complex SOCs have many hardware IP blocks integrated with the CPU core. The IP blocks communicate with one another and to the CPU core via buses connected to them. Each of the hardware IP blocks will have one or more buses connected to it. 

Each bus may serve a specific purpose for the hardware IP block such as obtaining its configuration parameters, obtaining input data for processing or giving out the output data after processing. A bus may be designed in different widths such as 32 bit lines, 64 bit lines or 128 bit lines depending on the transfer bandwidth requirements of its hardware end points. Data transfer is done over a bus between the end points in units called transactions. 

A transaction encompasses actual data transferred, and also the address of data and any clocking/control signaling for synchronizing transmission and reception. A transaction can be a read type or a write type. The size of data transferred in one transaction is called transaction size. 

A hardware IP block can use different transaction sizes for different purposes on a single bus. This is based on a design time decision based on different uses of the bus. For example, a hardware IP block may do number crunching operations on 32-bit data units and may obtain four such units in one 128-bit transaction. 
Apart from data, it may need a periodic but less frequent update of a 32-bit parameter value which it may obtain using a 32-bit transaction. A transaction size can be at most the width of the bus.

For example, on a 128-bit wide bus, transaction sizes of 32-bit, 64-bit or 128-bit could be designed. Finally, by design, one of the hardware end points of a bus is a master (initiator) of transactions on the bus while the other end point is a slave (target). If a hardware IP block is the master on a bus, then it initiates the transactions on the bus to read or write the data to a target. If it is slave, then it just responds to transactions from a master. The above is generally true even with regard to discrete hardware components in a non-SOC system.

What are the endianness considerations for the hardware IP block designers and device driver developers when transferring data over the buses? Let us assume data is being transferred from memory to an IP block in transactions of size 64 bits over a 64-bit wide bus. The endianness of the bus determines how the individual bytes of data starting from a specific source address are placed on the physical byte lanes of the bus.

On a little endian bus, the data byte from the lowest source address is placed on the lowest numbered physical byte lane (bit lines 0-7):

Figure 2

On a big endian bus, the data byte from the highest source address is placed on the lowest numbered physical byte lane (bit lines 0-7):

Figure 3 

Let us say, a hardware IP block does some number crunching operations one byte at a time. Say, its input data is in memory at address 0x80000000 as follows:

Figure 4 

When the 8 bytes of data are transferred from source address 0x80000000 in a transaction of size 64 bits, the data is placed differently on little endian and big endian buses:

Click on image to enlarge.

Figure 5 

Since the hardware IP block needs to number crunch the bytes in order – that is, 0x11  followed by 0x220x22  followed by 0x33  and so on, it needs to be aware of the endianness of the bus and extract the bytes from appropriate byte lanes for processing. On a little endian bus, it will need to extract the first byte from byte lane 0, the second byte from byte lane 1, the third byte from byte lane 2 and so on. On a big endian system, it will need to extract the first byte from byte lane 7, the second byte from byte lane 6, the third byte from byte lane 5 and so on.


Now,instead of processing one byte at a time, let us say that the hardwareIP block was designed to number crunch 32-bits as a unit. So, it needsto obtain the first 32-bit data unit to process first, followed by thesecond 32-bit data unit. In this case, it is reasonable to assume thatinput data was also prepared in 32-bit data units (by, say, software)and written into memory as 32-bit data units. Let us say software writesinput data 0x44332211  and 0x88776655  in that order into memory at address 0x80000000 . The memory snapshot looks as follows for little endian and big endian systems:

Click on image to enlarge.

Figure 6 

Whenthe 8 bytes of data are placed on the bus as part of a 64-bittransaction, the data appears as follows for little endian and bigendian buses:

Click on image to enlarge.



Click on image to enlarge.

Figures 7 and 8

Whenthe hardware IP block extracts the two 32-bit data units on a littleendian bus, it will need to extract the first data unit (0x44332211 ) from the lower four byte lanes (0-3) and the second data unit (0x88776655 )from the higher four byte lanes (4-7). On a big endian bus, it is theother way. Note that because software stored input data in memory in32-bit data units, we did not have to worry about the byte order withineach 32-bit data unit. Between the two endianness modes, the byte orderwithin a 32-bit data unit is reversed when storing in memory and isreversed again when placing on the bus.

Thus,when a hardware block obtains input data or gives out output data on abus, there are several considerations that the hardware IP blockdesigners need to be aware of:

  • What is the unit size of data being processed in the hardware IP block?
  • In case of input data, is the source of data another hardware IP block or memory? In what order and what unit sizes is the input data prepared? Similar considerations apply in case of output data.
  • What is the bus transfer transaction size?
  • What is the endianness of the bus?


Letus switch gears and think for a moment that a hardware IP block wasdesigned for a specific endianness, irrespective of the endianness ofthe system. Examples of this situation are discrete hardware devicesconnected on a PCI bus. PCI bus is defined by standard to be littleendian and devices connected to it therefore expect data to appearlittle endian on the bus – that is, byte destined to the lowest addressto be in the lowest numbered byte lane. 

Thehost processor to which the PCI device is connected to may be bigendian or little endian. In this case, it is the responsibility of somesoftware component (device driver) or some hardware component (anintermediate bridge between the host and the PCI device) to take intoaccount the endianness of the host processor and use a specific bytesequence on the bus (reverse the sequence of bytes if required) whilewriting or reading the data from the PCI device. 

This situationis similar to an SOC whose endianness (that of CPU core and businfrastructure) is configurable to be either little endian or big endianat boot time, but a hardware IP block on the SOC is implemented tofollow a fixed endianness. Let us say the hardware IP block is designedassuming the bus is little endian. In case the SOC is configured aslittle endian, the sequence of bytes in CPU’s memory will be preservedwhen transferred to the IP block’s memory. 

Incase the SOC is configured as big endian, for each transaction, thesequence of bytes in IP block’s memory will be reversed relative to thebyte sequence in CPU’s memory. To circumvent the reversal, the sequenceof bytes needs to be reversed for each transaction over the bus. 

Thismay be done by software when preparing the data in CPU’s memory or byan intermediate hardware bridge between memory and the IP block. In thefollowing example, we illustrate how software can organize the data inCPU’s memory before the (little endian) IP block reads it. We assume thehardware IP block processes input data in 8-bit units and obtains theinput data in 64-bit transactions.

Click on image to enlarge.

Figure 9 

The byte sequence would now be the same for each transaction on the bus regardless of the endianness of the bus:

Click on image to enlarge.

Figure 10 

Click on image to enlarge.

Figure 11 

Thesame sequence of byte data units on the physical byte lanes of the busserves the hardware IP block well to read out the bytes in the samefashion regardless of SOC’s boot time endianness configuration.

Noteagain that if the hardware IP block processes multi-byte data units(and if software accordingly prepares data in memory in the samemulti-byte units), then the reversing of sequence must happen for eachdata unit rather than each byte. For example, if the hardware IP blockprocesses 32-bit data units and obtains the data in 64-bit transactions,the data units must be organized in CPU’s memory (or an intermediatebridge must create such a sequence of data units for each transaction)as follows:

Click on image to enlarge.

Figure 12  

Itcan be seen that when the above data units are placed on the bus andobtained by the hardware IP block from the bus in each transaction, thesequence of data units will be same regardless of the bus endianness.
Thus,the swapping operation for data during transfer between hardwarecomponents of different endianness modes needs to take into account theunits of data processed. This situation can get complex whentransferring data units of different lengths such as data structurescontaining fields of different data types such as uint8_t (8-bit unit),uint16_t (16-bit unit) and uint32_t (32-bit unit). For example, considerthe following code section:

{
  typedef struct 
  {
    uint8_t elem8_1;
    uint8_t elem8_2;
    uint8_t elem8_3;
    uint8_t elem8_4;
    uint16_t  elem16_1;
    uint16_t  elem16_2;
    uint32_t elem32;
  } structDef;
  structDef *  structVar;
  structVar =  (structDef *)malloc(sizeof(structDef));
  /* Bookmark A */
  structVar->elem8_1  = 0x11;
  structVar->elem8_2  = 0x22;
  structVar->elem8_3  = 0x33;
  structVar->elem8_4  = 0x44;
  structVar->elem16_1  = 0x6655;
  structVar->elem16_2  = 0x8877;
  structVar->elem32  = 0xCCBBAA99;
  /* Bookmark B */
  …
}

AtBookmark A, let us say structVar gets allocated at address 0x80000000in memory. At Bookmark B, the content of memory at address 0x80000000would look as follows in big endian and little endian systems:

Click on image to enlarge.

Figure 13 

Note that byte sequence for multi-byte fields is reversed between big endian and little endian systems.
WhenstructVar memory is prepared on a big endian hardware component andtransferred to a little endian hardware component or vice versa, thegoal of underlying transfer logic must be to ensure that structVarcontent is organized in destination’s memory conforming to destination’sendianness as shown the above picture. If the destination is littleendian, individual bytes of structVar content must be in the sequenceshown above for a little endian memory regardless of the sequence inwhich they are organized in source’s memory. 

To support this, anintimate knowledge of hardware (transaction size) as well asapplication context (data structure) is required. One option is for thetransfer logic to separate the raw byte sequence from the applicationcontext. An intermediate hardware bridge or a low level device drivermay be used to ensure that the byte sequence in source and destinationmemories is the same. To interpret multi-bye data fields of a datastructure correctly, software (application logic) can do any byteswapping within the data fields.

Acknowledgement:  I wouldlike to thank Eldad Falik, a hardware IP design manager at TexasInstruments Inc, for his many clarifications on the topic. References

Sandeep Yaddula  isan applications engineer in the communications infrastructure businessof Texas Instruments. Prior to his current role, Sandeep was a softwaredesign engineer in Voice over Packet technology. He holds a Master'sDegree in computer science and engineering from the Indian Institute ofTechnology in Mumbai. As a hobby, Sandeep pursues certain topics ofmathematics to obtain deeper insights and recently started a personal blog

Further reading:
On holy wars and a plea for peace (Little Endian vs. Big Endian) 
PCI Local Bus Specification (Revision 2.2 – Dec 18, 1998) 
Endianness and Addressing 
Discussion 1 at pcisig.com 
Discussion 2 at pcisig.com

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.