Achieving 200-400GE network buffer speeds with a serial-memory coprocessor architecture
Editor’s Note: In this Product How-To, Michael Miller of MoSys describes the challenges faced on the wired Internet backbone as increased network line and packet rates cause throughput bottlenecks at the processor/external DDR memory interface. He then shows how a new serial chip-to-chip protocol the company has developed, called the GigaChip Interface (GCI), with 200-400 GE data rates and 4.5 B read/write transactions, can be used to eliminate such bottle necks.
As network line rates and packet rates are increasing, the need for high efficiency, reduced latency, fine granularity interfaces to memory and coprocessors has become critical. Buffer traffic at 400GE will require approximately 900 I/O pins at 3.2 Gbps to DDR4 memory. Any additional off-chip memory operations for header processing would require as many more pins again.
Many designers will try to integrate all the memory on chip, and this will put challenges on how much computing resources can also be included on the same die in the face of requirements to improve computation by 4x. Even with advanced packaging, these pin counts are not achievable when the line interface and power pins are included. I/O pins exact a cost not only in larger packages, die area, but also power. Being I/O efficient is an important aspect of today’s architecture. Protocols play a large role in efficiency of information transfer.
However, currently available device-to-device serial interfaces used to deal with such latencies suffer from several shortcomings including channelized one-way transport or they target specific applications, such as memory. These interfaces may also be optimized for large data packets, with the result that they may suffer from inefficiency due to the structure of their transactions with an ASIC or FPGA. Inefficiency arises because load/store transactions to and from memory occur in small synchronous transfers of data, such as 72 bits (64 bits + 8 bits of Electronic Dispersion Compensation). Such inefficiency incurs costs in the form of extra memory, additional traces, and therefore increased board real estate.
To meet these challenges, MoSys has developed a reliable serial chip-to-chip transport protocol that operates over OIF standard CEI SerDes and achieves 90% efficiency. The protocol, called the GigaChip Interface (GCI), can be scaled to 1, 2, 4 or 8 SerDes lanes as well as multiples of 8s. It targets computational and memory solutions with serial interfaces for networking equipment such as the Bandwidth Engine. Operating on existing devices with 16 lanes at 15 Gbps, the GCI provides enough bandwidth to support 4.5B read/write transactions and sufficient bandwidth to buffer full duplex 200GE. Doubling the pins or doubling the line rate (30Gbps) achieves full duplex 400GE.
After briefly describing the two most common transmission protocols – packets and data word translations - this article will provide details of the GCI protocol and its various layers and show how it can be used to achieve performance improvements in a typical system.
Channelized packets versus data words
The most common transmission protocols serve one of two categories: Packets or Data Word transactions. Packets are transported through multiple devices, one of which is often a switching device. Due to multiple end points and potential congestion in switches, packets face the reasonable possibility that they will be dropped. To address this problem, any serial interface protocol requires more complex error checking and flow control mechanisms. As shown in Figure 1 below, the protocol must provide the means to communicate multiple fields, including individual packet ID, priority levels, packet types, and end-to-end port ID fields.
Packet transmissions through the core electronics of networking equipment typically exhibit the following characteristics:
- Data rates of n x 1/10/40/100 Gbps
- Variable length packets from 64B to 1.5KB
- Packet arrival rate varies based on Data Rate and Packet Length
- Asynchronous transfer mode
- Reach of 8-30 inches possibly through connectors
- ASSP/ASIC/FPGA to and from network PHY or back plane
By contrast, data word transactions take place between two end points which could be either two peer devices such as ASSP, ASIC or FPGA or from a host device to a memory or co-processor. These rely on look-aside data word protocols and eliminate switch related issues. Data word transactions exhibit these transfer characteristics:
- Data is in fixed length frames of a predetermined size (eg. 32b, 64b, 72b, …)
- Rate at n x the packet arrival rate, often (typically 4 < n < 24 ), greater than 1 billion transactions per second
- Synchronous transfer mode
- Reach of less than 8 inches with no connectors
Packet integrity is the only concern and is typically low (BER 10-15) because signal integrity related issues are the source for information loss. Data loss can be managed by design best practices and error checking protocols.
Until GCI was developed, no commonly available protocol using SerDes had been optimized for synchronous fixed length transfers in the look-aside path. As options, designers have utilized channelized packet-oriented protocols over SerDes that result in higher overhead in resources and latency. The GCI protocol specifically streamlines device-to-device data transmissions and overcomes the inefficiencies of existing protocols for the look-aside application.
GCI’s three layer structure
The GCI specification defines three layers, the Physical Medium Attachment (PMA) layer, the Physical Coding Sublayer (PCS), and the Data Link layer.
The PMA layer in the GCI transfers 10-bit characters over a serial lane from one device to another. The PMA performs functions similar to those of the Physical Medium Dependent (PMD) and PMA layers in the Ethernet standards. These include electrical and timing functions, equalization, clock and data recovery (CDR), and serialization/deserialization.
Electrical and timing specifications are based on the Common Electrical I/O (CEI) 11G-SR standard. However, the GCI protocol stands alone and does not define electrical and timing requirements or equalization can be implemented with other electrical standards. Other characteristics and operations that take place in the PMA layer include:
- Both devices on a connection must use the same reference clock source; i.e., the clocking is mesochronous. Because the devices operate at exactly the same frequency, the GCI does not introduce skip symbols to compensate for clock rate differences.
- Each lane’s transmitter serializes 10-bit characters onto the lane. The least significant bit of the character is sent first.
- The receiver desterilizes single bits received at line rate and reassembles them into 10-bit groups. The receiver includes a clock and data recovery (CDR) circuit to align the incoming bit stream with the internal bit clock. The CDR circuit continuously tracks the location of the eye in the received signal.
- The training sequence provided by the PCS sublayer includes a pseudo-random bit sequence (PRBS), which can be used for training by the CDR circuit and a decision feedback equalizer (if present)
How the PCS sublayer works
As shown in Figure 2, the PCS encodes and transfers 80-bit frames over one or more serial lanes. To transmit, the PCS receives 80-bit frames from the Data Link layer and transforms them into 10-bit characters for the PMA layer. The PCS sublayer has the following characteristics:
- Before scrambling begins, the Tx and Rx exchange initial 48-bit state of the linear feedback shift registers (LFSR).
- Each lane is then scrambled with a PRBS generated by the LSFR for transmission. This provides sufficient transition density and DC balance for reliable, high-speed serial communications without the overhead of 8b/10b coding. On the receiver side, except in pathological cases, the descrambling process provides sufficient transition density for clock recovery by the receiver, as well as DC balance.
- In Tx, scrambling takes place after striping. In Rx, descrambling takes place after character alignment and deskewing and before lane reordering.
- The number of lanes in each direction in a GigaChip Interface connection does not need to be the same. For example, a memory device designed for write-mostly applications needs fewer transmit lanes than receive lanes.
- Striping breaks each 80-bit frame into 10-bit characters and distributes them over the available lanes. The PCS layer stripes the frames from the Data Link layer over 1, 2, 4, or 8 serial lanes. The Data Link layer does not need to know how many lanes are used in the lower layers. On the Rx side, the PCS identifies character boundaries in each lane by means of a synchronization pattern of ten 0s followed by ten 1s with each training sequence. To destripe, the receiver for each lane independently searches for the synchronization pattern and reassembles 10-bit characters from the logical lanes back into 80-bit frames.
- Following character alignment, the PCS compensates for character-scale (10 UI) skew between lanes.
- Deskewing amounts are established during a training period before any frames are transmitted. Deskewing is a similar function to that which is performed in the PCIe and XAUI protocols as shown in Figure 3 below.
The GCI data link layer
The Data Link Layer reliably transfers 72-bit payloads on behalf of the Transaction layer. This layer applies the CRC coding and the acknowledgment/replay protocol that detects and recovers from errors. It appends error detection and control information to the payload, creating an 80-bit frame.
To do so, the transmitter on the link computes a 6-bit CRC code over the non-CRC fields (74 bits) of all outgoing frames, whether or not the frames need acknowledgement. The CRC code is generated from the following polynomial (1):
G(x) = x6 + x +1 (1)
The receiver detects errors in incoming frames by recomputing the CRC bits. When the receiver section of a port receives good frames, it issues an ACK. It also reports bad frames by piggybacking that information in the ACK bit of all outgoing frames, regardless of whether the outgoing frame itself is acknowledgeable. When a port detects a CRC error, that port transmits a replay request. The source port then replays all frames after the last good frame. In this way, the GCI interconnect recovers from errors that are detectable by the per-frame CRC code. Figure 3 diagrams the round trip pathway for CRC and ACK.
The transmitter and receiver on a link maintain a synchronized 8-bit frame identifier (FID) counter for frames that need to be acknowledged. The transmitter section of a port includes a replay buffer to hold a copy of outgoing acknowledgeable frames. It can delete a given frame from the head of the buffer after the receiver section of the port receives an acknowledgment FID for that frame or for a later frame. When a link is carrying frames, the transmitter increments its FID counter by 1 for every acknowledgeable frame that it sends, as does the receiver. The process of serializing the FID over ACK bits is seen in Figure 4.
The CRC code for a payload arrives one frame interval after the payload, and an error in the payload cannot be detected until the CRC code arrives. If performing an action based on the payload would have irreversible effects, the Transaction layer postpones action until after the Data Link layer has verified the CRC code.
The purpose of the CRC is to detect errors and recover with a replay. Without CRC, a potential catastrophic failure occurs when an error goes undetected and state in the memory is corrupted with an erroneous write. The probability of this is a product of the line rate, the Bit error rate and the “coverage” of the CRC. For an 80-bit frame, the 6-bit CRC in the GCI detects all single bit errors, all but 17 2-bit errors, and 1233 of the 3-bit errors. For SerDes implementations of chip-to-chip links, all that is required is a continuous-time linear equalizer (CTLE) and potentially a lightly weighted decision adaptive equalizer (DFE). This reduces the probability of a 2-bit error to near BER2 and 3-bit error to near BER3.
Doing the math outlined in Koopman  and assuming a very conservative bit error ratio (BER) of no more than 1x10-15 per lane (1.0E-18 is much more likely), the undetected error probably (UEP) is 1.7E-29 (1.7 x 10-29) per frame. The undetected error rate for 10.3125 Gbps signaling is about 1.6E16 (1.6 x 1016) hours between failures, which is much lower than 1 Failure in Time (FIT or 1.0E9 (109) hours (150,000 years) over 8 lanes. In most systems, many other things will fail before an undetected error will occur. Future revisions of GCI may have the option for CRC-12 over two frames for the very conservative or much higher rate SerDes.
The GCI Transaction layer
The GCI does not define the transaction layer payload. The format and meaning of Transaction layer payloads is application-dependent. For example, a network processor and a memory device might use a master-slave protocol to communicate read and write commands and memory data. Alternatively, two ASICs might use a peer-to-peer protocol to stream long packets to each other. Figure 5 shows a transaction layer example.
How GCI deals with transmission errors
Either residual PLL jitter or quantum effects cause random bit errors. For the GCI, the BER is guaranteed to be below 10-12, and in practical implementation in chip-to-chip interconnects, 10-18 . By comparison, the BER for parallel interconnects is 10-19 .
Applying the ongoing bit error mitigation of the short CRC per operation ensures less than 1 Failure in Time (FIT). Comparing the system impacts to parallel interconnects, parallel memory interfaces usually uncover the data, leaving addresses and commands potentially vulnerable. However, GCI protocol protects all aspects of the interconnect including commands, addresses, data, etc. Since the GCI specifies error counts as part of its specification, error rates can be monitored periodically. In the very unlikely event that the error count exceeds a prudent level, the system may choose to retrain the links. Figure 6 shows the schematic of CRC error handling with positive ACK.
How GCI performs in a system
Because the GCI is agnostic it can carry a variety commands of varied lengths mapped into 72b frames. The data return is over a 72b bus. Figure 7 shows commands that cross frame boundaries are immune to the effects of errors. GCI can carry short transactions suitable for header processing as well as packet buffering applications. The protocol is efficient for short 9B (72b) transfers when compared to the Interlaken Look-Aside (ILA) or Hybrid Memory Cube (HMC).
Using GCI in a packet processor
GCI has been used in two generations of the Bandwidth Engine Family for the past 4 years. Multiple supporters and users have also adopted the GCI in ASIC and FPGA based systems including Altera, Xilinx, Tabula, LSI, and Avago. Although the GCI is compatible with the standard electrical layer as defined by OIF CEI 11 SR, it can also be used with other electrical standards. Figure 7 shows relationship of the GCI protocol in a packet processor including an intelligent serial memory co-processor.
When implemented, the GCI supports high bandwidth density with differential serial links and is lightweight, requiring only 100,000 ASIC gates for 8 lanes. In current design implementations that run 8 links, each at 10 Gbps, performance reaches in excess of 1 Billion 72b transfers per second with 90% efficiency. The protocol incorporates PRBS scrambled encoding, fixed 80b frames, and provides for 8 lanes @ 10Gbps with 1 ns latency.
The GCI is highly scalable in 1,2, 4, 8 and 16 lane configurations and scales with OIF roadmap. By implementing 6b CRC per frame with a built-in frame replay mechanism, it achieves less than 1 undetected error in 1025 frames transferred. Stated another way, the GCI achieves reliability of less than 1 FIT or 1 undetected error in 1 billion hours (150,000 years of operation).
Reliability is critical when transferring commands to co-processors and control memory. Any corruption in the data could result in lost state. The ideal interface uses a reliable chip-to-chip transport protocol that supports end-to-end data protection. The protocol must be agnostic with reference to the payload and precisely recoverable should a bit failure occur.
This is important to support transactions with higher levels of abstraction. In the future, higher levels of functionality are envisioned to support Longest Prefix matching and indirect data structure accesses. The protocol must also be efficient at transferring small transactions such as structure pointer values as well as large transactions for packet buffering. The protocol must provide low latency because this has a direct impact on how much system latency is incurred and how much speculative work is done before a transaction can be committed. The GCI protocol meets all these criteria.
Michael Miller is vice president of technology innovation and systems design at MoSys. Previously he was Chief Technical Officer, Systems Architecture for Integrated Device Technology, Inc. He has also held engineering management positions in software, applications and product definition for networking memory, RISC processors and communications ICs, serving IDT for more than 20 years. He has also managed software teams within two systems companies and filled logic design and application functions at Advanced Micro Devices. He has a bachelor of science degree in computer science from the California Polytechnic State University at San Luis Obispo and has been awarded 30 patents to date.
1. P. Koopman and T. Chakravarty, Cyclic Redundancy Code (CRC) polynomial section for embedded networks, International Conference on Dependable Systems and Networks (DSN), Florence, Italy, June-July 2004.
2. Watanabe, D.; Advantest Corp., Gunma Japan; Suda, M. ; Okayasu, T. “34.1Gbps low jitter, low BER high-speed parallel CMOS interface for interconnections in high-speed memory test system,” Test Conference, 2004. Proceedings. ITC 2004. International, October 26-28, 2004.