Taking the delay out of your multicore design's intra-chip interconnections - Embedded.com

Taking the delay out of your multicore design’s intra-chip interconnections

This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.

Today's SOC designers readily accept the idea of using multipleprocessor cores in their complex systems to achieve design goals.Unfortunately, a 40-year history of processor-based system design hasmade the main processor bus the sole data highway into and out of mostprocessor cores. The widespread use of processor cores in SOC designsand the heavy reliance on the processors main buses for primary on-chipinterconnect, produces SOC architectures based on bus hierarchies likethe one shown in Figure 1 below .

Figure1: SoCs with multiple processors often employ bus hierarchies

Because processors interact with other types of bus masters “including other processors and DMA controllers ” main processor busesfeature sophisticated transaction protocols and arbitration mechanismsthat enable such design complexity.

These protocols and arbitration mechanisms usually requiremulti-cycle bus transactions that can slow system performance. As moreprocessors are designed into a chip to perform more processing, thehierarchy of buses architecture shown in Figure 1 becomes increasinglyinefficient, because more processors are using inefficient busarbitration and transaction protocols to gain access to and to userelatively limited bus interconnect resources.

For example, the Xtensa LX2 processor's main bus, called the PIF,uses read transactions that require at least six cycles and writetransactions that require at least one cycle, depending on the speed ofthe target device. Using these transaction timings, we can calculatethe minimum number of cycles needed to perform a simple flow-throughcomputation: load two numbers from memory, add them, and store theresult back into memory. The assembly code to perform this computationmight look like this:

L32I reg_A, Addr_A ; Load the firstoperand
L32I reg_B, Addr_B ; Load the second operand
ADD reg_C, reg_A, reg_B ; Add thetwo operands
S32I reg_C, Addr_C ; Store theresult

The minimum cycle count required to perform the flow-throughcomputation is:

 L32I reg_A, Addr_A ; 6 cycles
L32I reg_B, Addr_B ; 6 cycles
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle
Total cycle count: 14 cycles

The large number of required cycles for processor-based flow-throughoperations often becomes a major factor that favors the design of apurpose-built block of RTL hardware to perform the flow-through taskbecause a conventional processor core communicating over its main buswould be too slow.

One frequently used way to solve this problem is to use a fasterbus. For example, like many RISC processor cores, some Xtensa processorconfigurations have a local-memory bus interface called the XLMI thatimplements a simpler transaction protocol than the processor's PIF.XLMI transaction protocols are simpler than PIF protocols because theXLMI bus is not designed to support multimaster protocols; load andstore operations can occur in one cycle.

Conducting loads and stores over the processor's XLMI bus instead ofthe PIF results in the following timing:

L32I reg_A, Addr_A ; 1 cycle
L32I reg_B, Addr_B ; 1 cycle
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle

Total cycle count: 4 cycles(with the same caveat regarding the processor pipeline)

This timing represents a 3.5x improvement in the function's cyclecount and this improvement can mean the difference between acceptableand unacceptable performance yet it requires no increase in clock rate.However, the XLMI bus still conducts only one transaction at a time;Loads and stores still occur sequentially, which is still too slow formany processing tasks.

Consequently, processor core vendors offer much faster alternativesfor on-chip, block-to-block communications. For example, Tensilica hasboosted the I/O bandwidth of the Xtensa LX2 processor core with twodifferent features called TIE ports and queue interfaces. (Note: TIE isTensilica's Instruction Extension language, used to customise Xtensaprocessors for specific on-chip roles.) These two features can easilyboost I/O transaction speeds by as much as three orders of magnitudewith no clock-speed increase.

Ports and queue interfaces are simple direct communicationstructures. Transactions conducted over ports and queue interfaces arenot conducted by the processor's load/store unit using explicit memoryaddresses. Instead, customised processor instructions initiate port andqueue transactions. Port and queue addresses reside outside of theprocessor's memory space and that address, and are implicitly specifiedby the custom port or queue instruction. One designer-defined port orqueue instruction can initiate transactions on several ports and queuesat the same time, which further boosts the processor's I/O bandwidth.

Using this interconnect, it's possible to create queue interfacesthat are especially efficient for the simple flow-through problemdiscussed above (load two operands, add them, output the result). Threequeue interfaces are needed to minimise the amount of time required forthis task; two input queues for the input operands and one output queuefor the result. With these three queue interfaces defined, it'spossible to define a customised instruction that: implicitly drawsinput operands A and B from their respective input queues; adds A and Btogether, and; outputs the result of the addition (C) on the outputqueue.

The problem becomes more interesting if we make the three operandsand the associated addition operation 256 bits wide. An off-the-shelf,32bit processor core would need to process the 256bit operands in eight32bit chunks using 16 loads, eight adds, and eight stores. A customisedprocessor core can perform the entire task as one operation. The TIEcode needed to create the three 256bit queue interfaces is:

queue InQ_A 256 in
queue InQ_B 256 in
queue OutQ_C 256 out

The first two statements declare 256bit input queues named InQ_A andInQ_B. The third statement declares a 256bit output queue named OutQ_C.Each TIE queue statement adds a parallel I/O port along with thehandshake lines needed to connect to an external FIFO memory.

The following describes a TIE instruction ADD_XFER that reads 256bitoperands from each of the input queues defined above, adds the valuestogether and writes the result to the 256bit output queue.

operation ADD_XFER {} {in InQ_A,in InQ_B, out OutQ_C} { assign OutQ_C = InQ_A + InQ_B; }

With this new instruction, the target task reduces to oneinstruction:

ADD_XFER

The hardware added to the processor to perform the ADD_XFERoperation appears in Figure 2 below.

Figure2: Creating a dedicated instruction requires very few additional gates,but delivers impressive results

Very few additional gates are required to add this ability to aprocessor core, yet the performance increase is immense. The ADD_XFERinstruction takes five cycles to run through the processor's 5-stagepipeline but the instruction has a flow-through latency of only oneclock cycle because of processor pipelining.

By placing the ADD_XFERinstruction within a zero-overhead loop, the processor delivers aneffective throughput of one ADD_XFER instruction per clock cycle, whichis 112 times faster than performing the same operation over theprocessor's PIF using 32bit load, store, and add instructions (14cycles per 32 bits of I/O and computation equals 112 clock cycles per256 bits).

Steve Leibson is with Tensilica Inc.This article is adapted from the author's book, Designing SOCs withConfigured Cores, published by Morgan Kaufmann.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.