Choosing the right memory for high performance FPGA platforms

High performance computing is critical for many applications and developers can often find solutions for their own embedded systems design problems in some of the most competitive of these applications. For example, high-frequency trading (HFT) is a form of algorithmic trading that accounts for the majority of US equity trading volume. High-frequency trading involves using machine-learning algorithms to process market data, implement strategy, and execute orders within microseconds. High-frequency traders move in and out of short-term positions at high volumes aiming to capture sometimes a fraction of a cent in profit on every trade. Systems using HFT algorithms constantly monitor price fluctuations for short-term trading strategies. Because it’s a very short-term trading strategy, HFT firms do not consume significant amounts of capital, accumulate positions or hold their portfolios overnight. Today, high-frequency trading accounts for more than 75% of US equity trading volumes.

At the turn of the 21st century HFT trading was focused on superior algorithms and trading strategies. So the advantage lay on strategy rather than speed with the most popular systems having latency of the order of seconds. By 2010, algorithmic improvements were not sufficient to gain trade advantages, and participants started reducing tick-to-trade latency to gain an advantage over each other. This brought trade time down to microseconds.

Stimulated by sub-millisecond buy and sell trade orders, HFT platforms are engaging in a highly competitive speed race to cut down market data round-trip latency into the microsecond order. Since a difference of even a few nanoseconds can create a big competitive advantage in the form of latency arbitrage (referred to as ‘front running’), trading firms are constantly on the lookout for faster trading servers.

Traditionally, software tools have been used to perform HFT trading. These tools make use of high-performance computing systems that are efficient in performing complex trading strategies (Figure 1). The OS kernels on these systems control access to CPU and memory resources while the application stack handles all trading strategies. A Network Interface Card (NIC) is used to interface the system to the stock exchange.


Figure 1. Order processing in a software based approach (Source: Cypress)

However, this configuration suffers from drawbacks with respect to tick-to-trade latency:

  • Standard NICs are not optimized to handle TCP/IP and proprietary trade exchange protocols, and cannot handle market feeds onboard.
  • There’s an added delay of a few microseconds on the PCI Express bus between the host system and Ethernet cards.
  • The interrupt-based approach of the kernel OS inherently causes long delays.
  • These solutions are based on multi-core processors sharing memory resources. Shared memory access is not best suited for deterministic latency which is critical when handling feeds from a stock exchange.

Recent advances in algorithmic trading have introduced some lower-latency solutions, the most promising of which is custom hardware built using Field Programmable Gate Arrays (FPGAs). These devices are a bridge between the extreme performance of hard-coded ASICs and the flexibility of CPUs. FPGAs provide a vast array of concurrent resources that can be configured to drastically reduce round trip trade latency compared to software based solutions (Figure 2).


Figure 2. Order processing in an FPGA based approach (Source: Cypress)

Besides being flexible, FPGAs can be programmed to be self-sufficient in processing critical tasks like data acquisition, risk matching and order processing. These self-sufficiencies make them faster and more reliable than software algorithms. The key factors that allows FPGA-based solutions to offer such massive improvements in performance in electronic trading is that they enable processes traditionally handled by software to run directly on FPGA.

These advantages that FPGAs hold over software-based algorithms are due to the following functions being offloaded to the FPGA itself:

  1. Handling of the TCP/IP message
  2. Decoding FAST or similar exchange specific protocols and stripping relevant data
  3. Making trading decisions without incurring any Kernel based interrupt delay
  4. Mitigating risk by managing order books and trade logging within FPGA

Due to these differences, FPGA-based solutions provide ultra-low latency feed handling as well as ever-faster order execution and risk assessment. They also attain maximum performance per watt to minimize energy and thermal requirements. Another advantage of FPGA solutions is the ability to scale to deploy “FPGA Farm” implementations.

Next page >>

A key part of the FPGA-based approach is the clever integration of QDR memories that allow for deterministic memory access rates and properly optimized VHDL codes. The two most critical data sets that need to be maintained in the FPGA’s memory are stock information for maintaining order books and data & time stamp logging for risk analysis. Both of them place different requirements on the cache memory. The data & time stamp logging of packets is important to keep an accurate record of trade decisions to reconstruct events from the past to learn from them. The kind of granularity needed for these records is in tens on nanoseconds. This makes memory latency (i.e., the time lag between providing memory with address and getting the data out on data bus) highly critical.

The other data set, the order book, is a database of all orders with symbols and prices that the trading system needs to maintain. This database is usually a small subset of all instruments that the exchange carries based on the stocks under interest from their clients. This order book needs to be updated and accessed simultaneously based on information received from the exchange and clients. The relevant data in the order book is compared with the data received from the exchange and based on the trading algorithm a decision to buy, sell or hold the instrument is taken.

Since the input data stream from stock exchanges is not received in a deterministic sequential manner, the memory access for implementing trading strategy is also random, done in small data bursts and quick data retrieval with minimum latency. In memory parlance, this ability to perform random accesses is measured in terms of a metric called Random Transaction Rate (RTR). RTR represents the number of random read or write accesses that a memory can support in a given timeframe. It is measured in multiples of transactions per second (for example, MT/s or GT/s). In most memories, the random access time is defined by the cycle time latency (tRC ). The maximum RTR is approximately the inverse of tRC (1/tRC ).

The choice of cache memory can often limit the full capabilities of FPGA-based hardware. Most FPGAs use traditional DRAM-based memories solely due to their cost advantage & higher density. However, these memories are extremely slow and prone to soft errors. Given the volume of trades undertaken by these systems every second, speed and reliability cannot be compromised.

Consider the two most widely used DRAM options from a pure technology perspective – Synchronous DRAM (SDRAM) and Reduced Latency DRAM (RLDRAM). SDRAM tRC has not evolved substantially over the past 10 years (nor is it expected to evolve going forward) and stands at ~48 ns, which correlates to a 21 MT/s RTR. Other DRAM-based memory devices have been designed to improve tRC at the expense of density. For example, RLDRAM 3 has a tRC of 8 ns, which correlates to a 125-MT/s RTR. Essentially DRAMs are optimized for sequential access involving deterministic computation algorithms, but high-frequency trading doesn’t work that way.

A better alternative is Synchronous Static RAM. Although DRAM-based memories offer higher memory capacity, they fail to meet the latency and performance desired from cache memories for trading platforms. Static RAMs have been the memory of choice for most high performance applications for decades. Compared to your average DRAM-based solution, an SRAM-based solution is faster by up to a factor of 24.

Among SRAMs, the QDR family of SRAMs offers the highest level of performance of any form of memory in the world. QDR SRAM are built explicitly for bursts of random accesses. With a port dedicated for both reading and writing, QDR memories are ideal for balanced read write operations like order book management. The latest QDR SRAMs such as the QDR-IV by Cypress goes one step further and offers two bidirectional ports. This makes QDR-IV highly efficient when the mix of read and write is not balanced, which is the case with operations like lookup for TCP/IP handling and feed handling.

The table below provides a comparison of the various core memory technology solutions:

Feature

QDR-IV

QDR-II+

RLDRAM3

DDR3 DRAM

Read Latency (ns)

7.5

4.5

8

47.6

Max. RTR(MT/s) Banked

2,132

900

1,000

88

Max. Frequency (MHz)

1,066

550

1,066

1,066

Bandwidth (Gbps)

154

97

77

34

I/O Ports

2 R/W

1 R + 1 W

1 R + 1 W

1 R + 1 W

QDR-IV memory delivers an RTR of 2132 MT/s with a latency of 7.5ns. Given how critical random access performance is for FPGA solutions, these memories help in drastically lowering overall tick-to-trade latency. The high operating frequency and dual port operation of this SRAM enables ultra low latency packet buffers, built for demanding network environments. In addition, the unrivalled random transaction rate of QDR-IV facilitates custom applications where immediate lookups into large tables or other data structures is required. While DRAM is a better memory for storing large amounts of information for data logging, a high performance SRAM can act in conjunction with it storing computational lookup or cache data for the latency critical path.

The below graphic compares the RTR performance of different memory technologies:


Figure 3. RTR comparison of different memory technologies (Source: Cypress)

Apart from the RTR and latency advantages, many SRAM memories are also incorporating a host of new features like Error Correction Code (ECC) for higher reliability, On-Die Termination (ODT) and De-skew training for improved signal integrity.

Given the competitive advantage a few nanoseconds can make, the type of memory used is also a critical aspect when building a custom FPGA-based solution. Due to the inherent advantages of QDR-based memories as explained above, many FPGA vendors are adopting QDR memory solutions into their latest generation high performance FPGA-based trading solutions. This places traders using these FPGAs at an early mover advantage over those using legacy memory solutions. QDR memories are supported by leading FPGA vendors like Altera and Xilinx. In their latest release, Altera announced support for QDR-IV with their latest Arria 10 FPGAs. Xilinx will soon announce similar support in the coming quarter on Ultrascale.

For more details on QDR memories refer to the QDR consortium website and the Cypress QDR-IV page.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.