Software-Friendly Hardware

Christopher Leddy

August 31, 2001

Christopher LeddyAugust 31, 2001

Software-Friendly Hardware
Embedded programmers often get stuck coding around an awkward hardware design. These tips for hardware designers promise hope, and more efficient systems to boot.

Many projects use ASICs or FPGAs controlled by a processor to implement a desired system. Logic designers often do a poor job interfacing their hardware to the native processor, creating severe problems during software development. Many of these problems can be eliminated with a proper understanding of the interface restrictions with respect to the underlying processor and high level language development environment. This article presents hardware interface concepts that accelerate the overall system design process by reducing the complexity of software design at the expense of a moderate increase in the use of hardware assets.


Embedded projects are usually completed in two phases: hardware design followed by software development. These two design tasks are usually completed by different teams, with minimal overlap. Projects developed in this manner often suffer because the software team has little involvement with the initial hardware design. A system designed using this methodology leads to a poor interface between the hardware and software development environments and results in increased system development time, increased development cost, and delayed product entry into the marketplace.

Involving the software team in the hardware design phase is the most logical solution, but is often impractical due to scheduling, funding, and staffing concerns. A reasonable alternative is to generate a set of hardware interface guidelines that will produce hardware designed to accelerate the software development process. Understanding the optimal hardware interface from the software developer's point of view prevents the creation of unnecessary process. General model of embedded system architecture

An embedded system can be viewed, at the system level, as a collection of interfaces between the various system elements, listed here as resources, and the system processor. The processor interface can be divided into two conceptual interfaces, labeled as the native and hardware bus. Note that the "buses" in this article are defined solely by the type of access the processor makes when using a resource, and need not correspond to separate discrete hardware connections.

The native bus is defined as the bus that interfaces resources to the processor in a manner that allows unrestricted, contiguous access. Unrestricted access means that the processor can access all elements of a resource using its native data types (such as bytes, words, and double words); contiguous means that all elements exist within the resource address space without holes. Examples of items usually interfaced to the native bus are RAM and EPROM. The hardware bus provides connections to resources with access restrictions such as size, location, addressing, address space, or relocation. Examples of hardware bus interfaces are an I/O port decoded to only accept word writes or a peripheral chip on the PCI bus that must be mapped before use. Hardware bus connections to resources require the programmer to access a resource in a limited manner and can be a source of code complexity and errors during software design, development, and integration.

Proper interface design of the hardware bus expedites software design and implementation, and often accelerates the hardware verification process. This article focuses on the design and implementation of the hardware bus interface to programmable logic resources. Sample system definition

Consider a system with two different hardware implementations. The system is a three-axis processor-controlled servo. System design is limited to positional feedback control, to allow us to remain focused on the hardware interface implementation. Both implementations of the system interface the system processor to a custom ASIC (or FPGA), which provides both drive and feedback signals for a three-axis servo. The ASIC in each system must interface the system processor to three sets of drive/feedback resources using the processor's 32-bit data bus. Each resource consists of a 10-bit signed drive register, an 8-bit signed position register, and a 3-bit fault status register (any set bit is an error condition and should cause the axis drive to shut down).

Figures 1 and 2 present two reasonable implementations of the desired system's register interface, labeled simply as System Implementation A and System Implementation B. For simplicity, the remainder of the article refers to the two implementations as System A and System B.

Figure 1: System A register map

Figure 2: System B register map

Both hardware interfaces are of approximately the same design complexity when implemented in VHDL (or other high level hardware design methodology). System A is slightly more efficient because the register address decoding is less complex and uses fewer hardware assets than system B. As such, most hardware designers would choose the System A implementation in order to reduce the number of logic elements used in the programmable device interfacing to the processor.

Listing 1 presents pseudo-code for an axis drive routine to be used with either of the example systems. The pseudo-code is designed for implementation on a modern processor, running under a real-time operating system, implementing axis control as three independent copies (or task instances) of a common axis control routine. The lines in the pseudo-code marked with an asterisk are only required when using the interface defined for system A, as discussed later.

Listing 1: Pseudo-code for the example system

if (Axis Status shows ERROR)
Interrupt and Task Block (*)
Set Axis Drive to Zero
Unblock Interrupt and Task (*)
Get Axis Position
Sign Extend Position (*)
Compute Proportional Term
Interrupt and Task Block (*)
Get Axis Drive
Sign Extend Drive (*)
Compute and Set New Axis Drive
Unblock Interrupt and Task (*)

It is clear, even at the code prototyping stage, that System B will require much less code than System A. The slightly more complex hardware design in System B has already reduced the software development burden. Later paragraphs will refer back to these two example systems and the pseudo-code, so keep them in mind as you read the remainder of the article. Hardware designers reading this article may be asking themselves "why is the first design considered less efficient than the second?" Both implementations have a similar set of parameters that control operation of the axes, and the first choice definitely requires fewer programmable hardware assets than the second. To properly answer this question, the designer must view the design from a system perspective instead of a hardware engineer's typical "collection of logic gates" viewpoint. The next section presents some general concepts to consider as the hardware designer develops the hardware interface for a system. Later we will dig a little deeper into the specifics and review the results of applying these concepts to our sample system design.

Before we progress further, I must caution the reader that the ideas presented in this article are biased toward improving the software development process. It is assumed that the hardware designer is already well versed in hardware design and wishes to understand how the hardware interface influences the software development process. Optimization of the overall system architecture requires compromise between both hardware and software implementations, in order meet project requirements; no practical project can (or should) meet all of the ideal software interface principles presented herein. Knowledge of the ideal allows the hardware designer to identify and remove unintentional impediments to software design. Design principles

1. Use standard bus access
The general guiding principle of efficient embedded hardware interface design is to design the hardware so that access to the resource is as transparent as possible to the software developer. Transparent access is achieved when the processor can use all standard read and write instructions without regard to previous access content or timing.

Schemes such as paged register sets, write data encoded on address lines, and register read and write access with settling times different than native processor cycle times all create difficult situations for code development and usually require the development of driver routines to convert to and from standard access to the required special access. The use of special buses is often unavoidable, but the choice of using special access spaces should be reviewed carefully with respect to difficulties it will create when the system software is designed. System A uses some write-only registers, which require the system software to provide "shadow" memory to hold the values written to the resource. System B removes this restriction by allowing all registers to be read as well as written.

2. Develop a processor-based view of resource interface design
Hardware designers tend to view resource interfacing from the bottom up, viewing the resource's connection to a system bus. A better view of a resource is to follow processor access to the resource through the system.

The "processor to resource" interface is almost always the interface of primary importance and its efficiency should be a major priority during the hardware design cycle. Identification of resource access through the entire system is crucial to understanding the access limitations imposed by hardware design choices. Modern systems include memory controllers and remappable buses, which modify the types of access the processor makes when interfacing to a resource. Often, a poor hardware interface design does not become apparent until the software team attempts to interface to the resource and is then forced to make do with what exists. Neither example system illustrates this concept, but it is still important to remember when designing hardware interfaces.

3. Create and maintain a system memory map
A memory map of all resources (native, hardware, and even required holes) is vital to good system design. As I mentioned above, the memory map should be developed with respect to the processor and not simply state what address lines a resource decodes. If register-configurable resources such the PCI bus are used, the hardware designer should locate all the configuration registers associated with that resource in the memory map and provide initial configuration register values to create a static map for hardware verification.

The hardware designer must also carefully consider the merits of dynamic reconfiguration. A system with no addition (or subtraction) of resources on a reconfigurable bus can easily be degenerated into a static map by forcing the configuration registers to the same value after system reset. This "static" system map provides a stable consistent architecture for both hardware integration and software development and also eliminates the use of error-prone pointer operations (such as the dreaded "pointer to an array of pointers") in the deliverable system code.

Finally, the memory map must also be maintained as the design matures, and updated throughout both the hardware and software development efforts.

4. Enforce consistent access methods
Modern embedded systems are so complex that they are often designed by more than one person. The design of each hardware component must be coordinated with the whole so that a consistent "look and feel" to resource access can be developed. Differences in access from one functional block to another create the potential for access restriction errors during software development and may require special software drivers for each subsystem in the design. Inconsistent access restrictions across different logic blocks also makes hardware integration and verfication very difficult.

Debugging tools used during software development and hardware integration are notoriously difficult to configure for multiple types of restricted access (just because you edit four hex digits on the debugger does not guarantee that it uses a 16-bit read/write cycle!). As an aside, it is useful to evaluate candidate emulators on how well they can handle multiple restricted access address spaces, especially in processor architectures that trigger a bus fault on "out of restriction" access methods.

Register design

Now that we have turned the focus of the hardware designer away from logic gates and buses and towards system design concepts, we will review the design of the most ubiquitous interface component in any processor-based system: the register. Register interfaces allow high speed access to a resource, and the efficiency of that access has a tremendous impact on system performance.

Register set organization and access

Hardware register size should be chosen to be the most efficient hardware access method used by the processor, almost always the native integer access method. Registers should be decoded as a contiguous group (without address "holes") in order to facilitate register access with pointers or array indices.

Any register that can be written should also be readable (in exactly the same format), to prevent the need to buffer the values of these registers in local memory. Registers that control a subsystem should be grouped together in similar organizations, to make them accessible to software through the use of common driver routines. This becomes especially important when multiple subsystems of the same type are required in the design. To avoid contention between software tasks that can be coded as independent processes, independent subsystems should not share writable registers within a native processor access. These "independent" software processes tend to step on each other as they access the shared registers unless noninterruptable read/modify/write drivers are used in the system code. Shared registers among multiple processes may even incur the overhead of a function call for each access, depending on the operating system. Misuse of shared registers by accessing them without locking out other processes is also a common software design defect and results in intermittent system failures, slowing the process of integrating and testing the system software.

System A violates a number of the above concepts by using write-only registers, shared control and status registers, and by not presenting a common register map for each axis. Special driver routines must be used to buffer the written outputs, shift and mask axis drive and position information, and protect the axis drive register contents from being corrupted by the code written for each axis task. System B corrects these deficiencies by separating and regrouping the registers associated with each axis.

Register reset contents

It is vital that the hardware designer carefully consider the reset state of the system. Poor hardware designs require the boot code to grab control of the system on power-up and initialize the system to a safe state. This approach first fails when the system is powered up in the lab without working boot code, and again when the software design team tries to use emulators and debuggers that don't execute code at reset until triggered externally by the emulator control software.

System reset should place the hardware into a known safe state and the hardware should remain "safe" until the software initializes the system. Code should also be able to reset the hardware under software control to assist in debugging, self test, and initial code development.

System A does not control the reset contents of the drive registers and requires the code to rush in and set the drive to zero for all three axes. This configuration creates a major system design problem because the processor is usually held in reset until well after the FPGA and ASICs have powered up and configured. Additional problems arise in System A during integration if the developer uses an emulator: processors controlled by an emulator may take many seconds to initialize and operate after power is applied. In both cases, axes are driven randomly until the software takes control.

System B sets all axis drive registers to zero on power-up and is not dependent on boot time to control axis drive settings. The addition of a software reset register was not considered necessary in this design because there are no hidden state machines.

Field design

Most resource interfaces include data items that do not fit exactly within one register. In such cases, the hardware designer is forced to break a register into fields. Proper field organization is critical to system performance, often having as great an impact as the register interface design. The rules for effective field interface design are similar to those for registers, but the designer must also be concerned with the order and placement of fields and the treatment of unused bits remaining in a register.


A field is defined as a subset of bits within a register that is used to report or control a functional element of a resource. The most common field types used in hardware design are boolean (true or false values, usually one bit), multi-bit status and control (multiple bits that report or control inter-related functions), enumerated status and control (a collection of bits taken together with each bit pattern representing a different hardware state), and numeric (bits that are taken together to represent the value of a quantity).

From a software perspective, the most effective organization of fields is to use only one field per register. This ideal software organization may result in an inefficient hardware implementation; good system design may require compromise by placing multiple fields in each register.

The rest of this article will treat multiple field organization of registers as an assumed necessity. The reader should still consider using single field registers on a case by case basis, when efficient access to a particular parameter of a resource is critical to system software performance.


The organization concepts previously presented for registers also apply to fields within a register. A register should only contain fields that pertain to one functional element of the design, and all writable fields in that register should also be readable.

Registers that contain fields from multiple functional elements again create the need for special drivers (as discussed previously), to allow multiple processes to safely access each distinct field. Fields configured as "write only" require shadow memory locations to contain the previous state of the fields in the register. What the hardware designer originally envisioned as access by a simple "mask/write" operation becomes an awkward multi-step function call that has to lock out interrupts and task switching, read from local memory, mask in and out values, perform a hardware register write, and then release interrupts and re-enable multitasking. The above scenario can be avoided if all the fields in a register are arranged so that they are accessed by only one software task.

System A requires the use of special driver routines because it combines fields for unrelated functions into a single register. System B provides an extreme form of the "fields organized by task within a register" philosophy, placing each field in its own distinct register, providing highly efficient access to each axis parameter in the resource.


The hardware designer should also be aware of alignment restrictions enforced by the processor and the software development environment. Making a field cross a word boundary at the wrong address can require the software designer to access a field in pieces, slowing down and unnecessarily complicating access. When debugging, it is also very useful to have fields padded with zero bits such that the least significant bit of each field is aligned with a hex digit (4-bit) boundary: hex digit alignment assists with visual extraction of field values when registers are displayed on a logic analyzer, debugger, or emulator.

System A shows no alignment of fields within registers. It is very difficult to extract the field values from the raw hex data that would be presented to the user by a debugger or emulator reading a given register. Masking in test inputs is also very difficult when troubleshooting, due to the unaligned nature of the control fields. System B has all fields aligned on even hex digits; the state of each field is easy to determine from a register read, and setting a field to a given value is trivial.

Placement and order

The placement of fields within a register can also have a significant impact on efficient software implementation. Boolean and multibit fields are usually position-independent, but enumerated and numeric fields are accessed most effectively when they are placed in the least significant bits (LSBs) of a register (the actual bit numbers for the LSBs depend on the endian-ness of the processor; never trust that bit 0 is the LSB). Placement of a field in the LSBs of a register eliminates the need to shift the contents of the field after masking, and makes identification of the field value easier when the register is accessed by test equipment or visually inspected with a debugger.

Placement of the fields for axis 2 and 3 in System A requires the software to mask and shift before using the values. System B has all numeric fields placed in the LSBs of a register, leading to much more efficient access. System B is also much more integration-friendly; a hex dump of a resource register can be visually broken into the proper field values.

Unused bits

Unused bits within a register can also influence software implementation efficiency. All unused bits should return zero and should be "don't cares" when written, to prevent the need for unnecessary masking and clearing operations. The only exception to this rule is a register that contains a numeric field represented as a two's complement number and the remainder of the most significant bits (MSB) in the register are unused. In this case, it is very useful to have the hardware implementation sign extend the MSB of the field into the unused bits. Numeric fields extended in this manner can be accessed directly by the processor as a signed value without the need for software sign extension. Combining this type of field with the "single field per register" recommendation becomes very useful when access speed to a particular numeric field variable is critical to overall system performance. The field can then be accessed directly as a native data access with no need for masking or sign extension. The implementation in System A requires the software to sign-extend the value of each numeric field when extracting the field value from the register. System B allows direct access to the field's value by native integer access to the register.

Type selection

Field type selection can also greatly improve software implementation efficiency. Boolean fields are most efficient when switching independent resource functions on or off. Note that single bit fields are easy to code only if the register provided is read/write. If the hardware register provides restricted access to the field, a special buffer (and possibly a special driver) will be required to hold the current contents. Restricted access can also limit the use of some programming constructs (such as bitfields) that make the system code more readable and help reduce programming errors.

Numeric fields are useful for data that occupies a range of values when representing the state of a resource. Signed representations usually require more work in software and should be used only when a field can hold both positive and negative values. Avoid encoding other data in a numeric field (such as using the sign of the field to represent an unrelated resource state).

Multibit fields are usually more efficient from a hardware-implementation standpoint with dependent hardware resources, but can lead to increased code complexity when the system code is written. Enumerated types often better reflect the actual availability of a dependent function within a resource, and prevent the selection of conflicting functions (such as switching banks of memory onto a local bus). Enumerated types should also provide a selection that does nothing to allow a "parking zone" between switches, to allow for "break before make" switching code in the system software.

The "write only" access to the axis drive fields in System A creates an interface with inefficient software access to the required fields. RAM buffers must be used to hold the previous contents of the axes that are not being modified at the time of the write. System B resolves this problem by providing one register per field and allowing read/write operations. Performance evaluation of sample implementations

I converted the pseudo-code presented in Listing 1 into proper C code for both implementations in order to evaluate the performance of the resulting system software. The hardware interface for each system was simulated using a structure in native memory. The code avoided using bitfields, because the standard C implementation does not operate correctly on restricted-access address spaces. The system code simulation was run on a PowerPC and compiled using Green Hills MultiC. The target operating system was VxWorks and the compiler was set to mid-level optimization (to aid in debugging and allow the author to associate each assembly instruction to an individual line of C code).

Table 1 lists each line of the pseudo-code and shows the number of assembly instructions and function calls made for each system implementation. The code for both implementations was also measured for execution speed. The subroutine updating the axes on System B ran 5.3 times faster than that on System A, mostly due to the removal of the task block and unblock function calls. Note that the speedup may be less extreme in a real system because true hardware access time would probably become a significant contributor to overall execution time.

Table 1: System implementation function call and assembly line count
Pseudo-code line System A System B
Asm Lines Func Calls Asm Lines Func Calls
if (Axis Status shows ERROR) 14 0 4 0
Interrupt and Task Block 1 1 0 0
Set Axis Drive to Zero 15 0 2 0
Unblock Interrupt and Task 1 1 0 0
else 0 0 0 0
Get Axis Position 11 0 2 0
Sign Extend Position 6 0 0 0
Compute Proportional Term 2 0 2 0
Interrupt and Task Block 1 1 0 0
Get Axis Drive 10 0 1 0
Sign Extend Drive 6 0 0 0
Compute and Set New Axis Drive 30 0 3 0
Unblock Interrupt and Task 1 1 0 0

I experimented with increasing the compiler optimization level on both implementations; the increased optimization had no effect on System B, and only minor reduction in code size and execution speed for System A. Such results indicate the hardware interface presented by System B is very close to "native" efficiency with respect to resource access of the axes' fields.

I also wanted to get an estimate of the hardware assets used by both implementations, so I coded the hardware interfaces in VHDL, and then used Xilinx Webpack software to synthesize and map the designs into the Xilinx Virtex (high complexity FPGA) and 9500 (low cost) series FPGAs. With the Virtex series, System A consumed 56 slices as opposed to 85 slices for System B. The particular device chosen (V300E-PQ240) had 3072 total slices available, with System A consuming 1.8% of available resources vs. 2.8% for System B. The 9500 series is much more limited in its internal resources, and on an XC95288XL-PQ208 device, System A consumed 18% of the available product terms vs. 30% for System B.

Reviewing the two designs in detail led me to the conclusion that the major driver in additional asset usage for System B was the combined axis addressing scheme. To verify this, I reorganized the register map such that each axis was considered as a separate resource and the individual axis maps were aligned on address bit boundaries. This alternate implementation preserves all the software interface advantages of System B, while decreasing overall hardware asset utilization. This alternate system architecture reduced the slice usage in the Virtex series to 2.3% and the product term usage in the 9500 to 22%.

Understand the impact

Hardware design practices can greatly influence the complexity and quality of a system's software implementation. A good hardware design requires the designer to make decisions based on the complexity of both the hardware implementation and the resultant software design environment. Understanding the impact of hardware interface design on the software development process can dramatically improve system quality, performance, and reliability, while reducing the time and cost of the system development cycle.

Christopher Leddy has been programming computers for over 25 years, and specializes in embedded systems hardware and software design. He holds an MSEE from the University of Southern California and a BSEE from the State University of New York. Christopher is currently a senior principal systems engineer at Raytheon. His e-mail address is

Return to September 2001 Table of Contents

Loading comments...