Tips for doing effective hardware/firmware codesign: Part 2 - Embedded.com

Tips for doing effective hardware/firmware codesign: Part 2

Editor’s Note: In the second part of a series excerpted from Hardware/firmware Interface Design, author Gary Stringham provides best practices techniques for performance, power-on interactions, communications and control.

A success factor of embedded systems products is its performance. Is it fast enough to meet the customer’s requirements and expectations? But is it cheap enough that the customer will buy it?

Putting a V-8 engine on a lawn mower will definitely provide sufficient performance; however, it will be too expensive for the consumer. Performance must be weighed against cost.

Previously discussed in Part 1 were tradeoffs between polling a status bit and waiting for an interrupt. Interrupts allow firmware to work on something else until the event occurs, and then be notified immediately when it does occur.

But judicious use of interrupts is required to avoid bogging down the system with interrupts occurring too frequently. Likewise, other aspects of the hardware/firmware interaction require judicious designs to ensure optimal performance.

This section discusses a few techniques to maximize the performance at the hardware/firmware interface without incurring too much cost in the platform.

Increasing the Buffer. Increasing the buffer size for I/O data allows more data to be transferred with fewer interrupts. But the question is: how big should the buffers be? It depends on the application. Table 1 contains some guidelines.

Table 1: Guidelines on Buffer Sizes

These and other system requirements may compete against each other in driving the buffer sizes and will require striking a proper balance. Increasing the buffer size does require more space on the chip, which must be taken into consideration. The overall silicon space of the buffer as a percentage of the whole chip is a factor. Doubling the size of the buffer from 8 bytes to 16 bytes has a small impact on the chip. But doubling the size of the buffer from 8 Kbytes to 16 Kbytes will impact the space requirements on the chip.

Best Practice Tip: Size receive and transmit buffers appropriately for efficient communication between hardware and firmware.

Working Ahead. It may not be just a matter of making the buffer size bigger. Once the buffer fills up, firmware has to deal with it.

But firmware will do so whenever it is allowed, given its priority compared to other tasks that need to run. After the block fills up one buffer, must the block wait around until firmware empties that buffer? Or is there another buffer that the block can start filling up?

If the block can work ahead, it can keep busy. The size and number of buffers dictate how far ahead the block can work. The same applies in the other direction. If firmware can queue up a bunch of work for the block, firmware can forget about it for a little while.

DMA controllers can help by providing chaining capabilities, allowing immediate continuation from one chunk of memory to the next. Buffers and chaining allow for continuous processing of data from one chunk to the next assuming that associated settings in configuration registers stay the same.

However, if configuration settings need to change between chunks, the block needs to finish with one chunk and stop, and then allow firmware to change the settings and start the next chunk. Stopping between chunks and requiring firmware interaction will not work if the system requires moving from one chunk to the next in a timely fashion.

An example of this type of system is a LaserJet printer where the raster image for one page is maintained in several chunks. Once the mechanical gear train starts moving paper and scanning the laser, it cannot stop. So the block has to move from one chunk to the next within 50 ns. Firmware cannot step in, set up, and start up the next chunk during that time. The solution to this is to double-buffer the necessary configuration registers, or basically have two sets of registers, the working set, and the hold set.

When hardware is working on one chunk with its configuration settings, firmware can load up the hold register set with the settings for the next chunk. When hardware is done with the one chunk, it can then transfer settings from the hold set to the working set and continue with no delay. When the block transfers from one chunk to the next, it interrupts firmware, notifying it that the hold register set is empty and can be filled with the settings for the next chunk.

Best Practice Tips: Provide data buffering, queuing, and chaining to maximize data throughput. Provide double-buffered registers so that firmware can queue the next task while the block is still running the current task.

Tuning. When a system is being designed, a lot of study is put into optimizing performance vs. power consumption. Educated guesses are used to figure out bus priorities, memory bandwidth, and clock speed.

Simulations are used to refine those numbers. But even after a thorough study, numbers may not be right because of incorrect assumptions or unanticipated use cases. This causes problems for fabricated chips if those numbers are hard-coded and need to change.

Some performance aspects in the chip can be designed to allow firmware to tune those numbers if the default values need to be changed. This could include being able to adjust bus priorities, DMA transfer sizes, clock speeds of different buses or blocks.

In some cases dynamic changes in the tuning numbers may be desired to alter the behavior based on conditions. Many battery-powered products do this by switching between being optimized for battery life or for performance.

Best Practice Tip: Make the chip tunable so that firmware can adjust performance characteristics such as bus priorities and clock speeds.

Margins. Consumers always want the new model of the electronic devices to be faster than the previous model. To satisfy that demand, it is not uncommon for existing chips to be looked at for producing faster products.

However, it is difficult to know if there will be enough bandwidth. Existing products are known to work okay, but how much performance leeway is there? Is it running at 60% or 95%? If it is 95%, then it cannot go much faster.

If the chip were designed with a 10 or 20% performance margin, it will increase its chance of being usable in the next faster product, thus saving several months and millions of dollars of designing a new chip just to make it slightly faster.

Best Practice Tip: Maximize performance margins to increase the potential for chip reuse in faster products.

Handling hardware power-on
Hardware typically comes out of power-on reset very quickly. Firmware typically does not. Firmware takes a long time to boot; it tests memory, installs the OS, installs the device drivers, and launches the applications.

Each device driver and each application has to be launched serially. Each one starts, initializes its data structures, and opens any connection ports to other firmware components.

It is not until the device drivers are installed (one at a time) that firmware starts to communicate with hardware. Each device driver initializes their respective block by configuring registers and enabling interrupts.

Before one firmware component can interact with another firmware component, both need to be up and running before they can start talking to each other. This requires some agreement or protocol between the two.

Typically if one block needs to interact with another block, they wait until their respective device drivers are up and running to coordinate the interaction. If, however, blocks need to interact at power-on, they must be able to do so without firmware assistance.

Any results (such as success, failure, status) from that power-on interaction can be reflected in registers that their respective device drivers can read when they start up.

Best Practice Tips: Design the block such that it does not require firmware interaction immediately at power-on. Design each block such that is still works at power-on even if its collaborative blocks are not ready yet. (Power-on initialization requires extra consideration because different hardware and firmware components are turning on in different states and initializing at different rates.)

If the two blocks are on different chips on different devices, it is very likely that power will not be applied to both devices at the same time. An example is a printer that is already on when the computer turns on.

The blocks handling the power-on protocol for the interface must be able to handle that power-on protocol, even if it already had had its power applied for some time. Besides going through the power-on protocol when the remote block powers up even though the local block has had power for some time, the local block must safely handle the case when it powers up when the remote block is off.

With the remote block off, then any incoming signals from that remote block must be assumed unstable and incorrect. The local block must gracefully handle that case. It should also convey that condition to the device driver when it starts up.

Tale from the Trenches. An engine interface block in an ASIC on the printer formatter board communicates with a print engine via a proprietary protocol that includes a power-on handshaking sequence. Most of the time, both the formatter and the engine are on the same power supply. But during early development in the lab, they are often on separate power supplies. When turned on at different times, the power-on handshaking sequence would fail and the block would cease further communications. When the engine interface device driver boots, it reads the status and discovers the communication error. Fortunately, the proprietary protocol contains a communication reset capability that the device driver invokes, allowing both sides to try again to get in sync.

Power-On State of I/O Lines. The power-on state of I/O lines mustbe in a safe or off state to avoid problems and danger until firmwarecan boot up and start safely coordinating the various activities.

Thefinal usage of GPIO lines is typically undefined when the chip isdesigned; it is unknown if they will be used as input or output. Inorder to avoid contention with multiple devices on the lines, the GPIOlines should default to input. Once firmware is up and running, it canchange appropriate pins to output as needed.

Besides GPIO lines,any other lines controlling external motors, switches, and so on shouldwake up in the off and safe state. It then waits until firmware is upand running enough to put everything in the desired state.

Block-Level Power Control. As devices become more sophisticated, as batteries and power sourcesbecome smaller, and as government regulations become more stringent,power is an ever-increasing issue. There are many efforts in theindustry to make faster and denser chips run with less power.

Onearea is the ability to power down individual blocks. This may be doneby actually removing power or by stopping the block’s system clock.Powering down individual blocks is desired when the system is in a powersave or quiescent mode, or if the block will not be used at all in thissystem, or if the block is not being used at the moment. When firmwaredetermines that conditions exist to do so, it can remove power from theblock.

When power is reapplied to a block, it will go through itspower-on sequence, even though the rest of the chip and any of itscollaborative blocks will have already been alive and operating. Thisrequires the blocks and firmware to handle “re-power-on” cases.

Best Practice Tip: Provide firmware-accessible power controls for each block.

Communication and Control
Efficienthardware/firmware interface requires access to necessary informationand flexibility in operation control. This means how ample, if notcopious, information about error situations, causes, and results shouldbe provided to help diagnose and resolve errors conditions.

Error Information. Inaddition to copious information in the documentation, copiousinformation should also be provided by the block when the error occurs.This includes current values of addresses, counters, external andinternal signals, and state machines.

Present as much relevantand even marginally relevant information to firmware as possible; it isunknown what little tidbit of information will give a clue into theproblem. Copious information allows the device driver to makeintelligent decisions in its error handling procedure.

Tale from the Trenches. In the engine interface block controlling a proprietary communicationprotocol with the print engine, a state machine controls the protocoland watches for various error conditions during the communication. Whenan error is detected, it transitions to the error state to generate anerror interrupt, and then returns to idle. Firmware is interrupted andtold that than an error occurred. But there was no indication of whatthe error was. We enhanced the block and added some extra status bits toan existing register to indicate which state detected the error. Thenwhen an error occurs, firmware queries the status bit for the additionalinformation, providing useful error data.

DMA Features. A distinct advantage of hardware is the ability to do things inparallel. A DMA controller is transferring data in and out of memory. Itcan perform some basic tasks on that data without impacting the datathroughput. One that has proved useful is a byte swapping ability. Thiscan help in a few different ways:

  • The block can work with bigor little-endian processors.
  • It can facilitate data exchanges between blocks or processors of different endianess.
  • It can handle data downloaded to the device in either endianess.
  • Minimize the amount of time firmware has to spend on byte swapping, a very tediousfirmware task.
  • It could workaround endianess problems within the chip.

Tale from the Trenches. The incoming DMA of a block was incorrectly wired to the bus with thewrong byte order. Since that DMA had a byte-swapping feature, firmwarewas able to configure it to swap it back before the data went on intothe block. This feature averted an expensive chip respin.

Acommon problem with embedded systems is memory stomps and corrupteddata. Building a CRC and/or a checksum generator inside the DMA canprovide a data signature that provides a sanity check on the data.Comparing the signature to data when written to memory by one block tothe signature of the data when read by another block will catch memorydata corruption while in memory.

Since data corruption problemsare typically not noted until the end of the pipeline, looking at theDMA controller CRC and/or checksum signatures at the various stepswithin the pipeline may give clues to corruption problems. Note that thesignature may not be the same throughout the data pipeline.

Ablock processing the data is likely to be modifying the data. So the DMAcontroller signature when the block reads the data may be differentthan the signature when it writes the data.

Adding the CRC and/orchecksum generator in the DMA controller module that is instantiatedthroughout the chip will ensure that the same algorithm is used in alllocations.

Best Practice Tips. Include abyte-swapping capability in the DMA controller module instantiatedthroughout the chip. Include a CRC and/or a checksum generator in theDMA controller module instantiated throughout the chip.

Sharing I/O Pins. Given that pins on a package are expensive, it is not uncommon for morethan one block to share pins. Output signals from more than one blockto the same output pin must be muxed since only one block can be allowedto drive the pin.

Input pins that fan to more than one blockshould also be switched such that only one block will get the actualsignals. This will prevent inadvertent interrupts and responses fromblocks that are supposedly not active.

The input signals notcurrently configured to be connected to the pin still needs to beconfigured to an appropriate asserted or deasserted level, such asdeasserted to indicate that the block is not ready for transmission.

Figure 2 illustrates this pin sharing between three blocks: A, B, and C. Block Ais currently selected to be connected to the pins. The signal comingout of block A is routed through the mux to the output pin.

Theoutput signals of blocks B and C are not connected and thereforeignored. The incoming signal is routed through the mux to block A. Theinput signal of block B is tied high while not connected to the pin andthe input signal of block C is tied low.

Figure 2: Three blocks using the same I/O pins, but only one at a time.

Best Practice Tips: For each chip output pin connected to multiple blocks on the chip,multiplex the block output lines to select which block controls thesignal on the chip output pin at any given time. For each input pinconnected to multiple blocks on the chip, multiplex the input line toselect which block (or blocks) receives the signal at any given time.

Hiding Implementation Details. The main emphasis of this book is with regard to what hardware lookslike to firmware. To firmware, a register is a storage location thatholds bits. But it makes no difference to firmware how that register isimplemented, whether the flip-flop is a JK-, SR-, T-, or D-typeflip-flop. No matter which type is used, it looks the same to firmware.Firmware can write to it and firmware can read from it.

Thispermits hardware engineers the flexibility to implement the designs asdesired, especially since the technologies and resources available varywidely on the varieties of chip platforms available. The fundamentalbuilding blocks and available resources for circuits are different amongthe various FPGA and ASIC technologies.

I will use as an example countup and countdown counters. Both have their uses with respect to their real-world application. Table 2 illustrates their uses.

Table 2: Uses of Countup vs. Countdown Counters

Acountdown counter can be implemented with a countup counter. Countingfrom 20 (0x14) down to 0 can be treated as counting from -20 (0xEC) upto 0. Translating 20 (0x14) to -20 (0xEC) is easily done by taking thetwo’s complement. Reading 0xF7 (-9) from the counter and taking a two’scomplement yields 0x09 (9) will let firmware know where the count is at.

Hardwareengineers might prefer to use countup counters because they may bereadily available on the FPGA they are using. Other factors, such as bitwidth, flip-flop type, fan-in, LUTs, and XOR gates affect whether acountup or countdown counter is preferable.

Firmware should nothave to care about the details of the implementation. If a countdowncounter was implemented with a countup counter, it is possible forfirmware to handle it by taking a two’s complement of the value beforewriting to the register.

However, firmware engineers must thenremember to take a two’s complement anywhere in the code that writes tothat register. In addition, a two’s complement must be taken anywhere inthe firmware code for any reads from that counter register. Othercompanion registers often accompany a counter register, such as a reloadregister containing a value that is used to load the counter upon someevent.

Firmware engineers must also take a two’s complementanytime the companion register is written to or read from. This exposesfirmware to errors if two’s complement is applied where it should notbe, or is not applied where it should be.

This is furthercomplicated if the block was implemented as a countup on an FPGA but as acountdown on an ASIC. Then the device driver will have to add theability to first determine the implementation used and then switcheverywhere applicable depending on the implementation. Again, anotherpotential source of bugs.

Firmware should not be required toaccommodate a hardware-specific implementation because it leaves exposedan incorrect firmware/hardware pairing. Hardware should accommodate thehardware-specific implementation because the accommodations will alwayscorrectly pair with the implementation. Therefore, the block shouldcontain the two’s complement translator, thus hiding implementationdetail from firmware.

In other words, hardware should provide ablack box to firmware, removing the need for firmware to know how it wasimplemented inside. Figure 3 illustrates how one two’scomplement translator can be implemented in the chip to provide thattranslation for any countup counter in the chip being used as acountdown counter.

Figure 3: The countup counter is being accessed through the two’s complement translator for countdown behavior.

Anotherexample in the counter area is the use of gray counters instead ofbinary counters. Gray counters consume less power and produce lessnoise. Again a translator in front of the gray counter would handleappropriate gray-to-binary translation.

Part 1: Event Notification.

This article is an excerpt from Hardware/Firmware Interface Design by Gary Stringham, copyright 2010, used by permission from Newnes, an imprint of Elsevier Publishing.

Gary Stringham is the founder and president of Gary Stringham & Associates, LLC .He has engineering experience in R&D and manufacturing with aproven track record of cost-savings and innovation in the design,implementation, and testing of firmware, hardware, and softwaresolutions. He also has extensive expertise in diagnosing and resolving abroad range of engineering problems. Gary worked for Hewlett-PackardCompany for over 21 years, working in Fort Collins, Colorado; Exeter,New Hampshire; Blingen, Germany; and Boise, Idaho. He can be contactedby writing to .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.