Your cache configuration can have a bigger impact on overall system performance than you expect.
Many CPU manufacturers now deliver two cores per processor socket, with four cores on the way, and even more to follow. But harnessing multicore processor performance depends on the software developers' ability to make efficient use of these additional cores. Multicore processors offer developers more avenues to increase their systems' compute performance. With more cache memory available, it's possible to optimize the locality of data references, which in turn drives higher cache hit rates.
Most high-performance CPUs implement cache memory to improve performance. To lower data latency, the execution unit is surrounded by small pieces of high-speed SRAM, known as cache memory. The execution unit can access cache memory about 80 times faster than system memory.
A simplified block diagram of a CPU and its cache implementation is shown in Figure 1.
Data retrieved from system memory traverses the memory controller, the system bus, the Level 2 cache, and the Level 1 caches. The L1 data cache interfaces directly to the execution unit while the L1 instruction cache interfaces to the instruction fetch/decode stage. The close proximity of the L2 and L1 caches to the instruction execution pipeline significantly increases performance.
Performance-analysis tools are available to measure the cache hit rate, which indicates the percentage of time the execution unit is fed from the cache. Cache hit rate is a measure of cache efficiency, and it often correlates to the application program's reference locality, which refers to the degree to which a program's memory accesses are limited to a relatively small number of addresses. Conversely, a program that accesses a large amount of data from scattered addresses is less likely to use cache memory efficiently. For example, a packet-processing application that's performing TCP reassembly on a large number of TCP flows is likely to access a large amount of data over many memory locations. This results in a lower locality of reference than an application that's performing TCP reassembly on a smaller number of TCP flows.
The transition to multicore processors promises more than an increase in the number of execution cores and computation capability. It offers additional flexibility for development and optimization of higher performance applications. In particular, multicore architectures can significantly improve program flow so that cache memory associated with each individual execution core is used more effectively. With multiple caches available to software developers, it's possible to optimize data locality, driving higher cache hit rates and improving overall application performance.
To illustrate packet-processing performance improvements from pipelining and flow-pinning, Intel's Communications Technology Lab parallelized and tested the Snort application running on one and four execution cores. Snort is an open-source network intrusion-prevention and -detection system that uses a rule-driven language to combine the benefits of signature, protocol, and anomaly-based inspection methods. (See www.snort.org for details on this security application.)
The Snort dataflow is composed of five functional processes shown in Figure 2.
The PCAP (packet capture) library processes incoming network traffic by capturing packets in their raw form. PCAP is an open-source network-traffic capture library used by many BSD, Linux, and UNIX applications. Next, the packets are passed to the packet decoder, which identifies the protocol and breaks down the packet according to its particular data-link specification. There are preprocessors in stage three that prepare packets for stage four, the detection engine. First, the preprocessors reassemble IP fragments and TCP streams. Later, the preprocessors check the packets for suspicious activities that are outside the detection engine's scope. The preprocessor may modify these suspect packets so the detection engine can process them more effectively. The detection engine performs a series of tests to detect intrusions based on a set of user-defined intrusion signatures. Finally, reports of suspicious activities are generated by one or more output plug-ins.
Migrating to four cores
The Snort application is C code that runs on many Unix flavors, including Linux. To parallelize this application, we used the POSIX threading library (pthreads). This library lets programmers create or destroy threads and use various synchronization primitives. For example, pthreads includes function calls that can manipulate mutexes that protect shared data structures.
Our first attempt to multithread Snort was to run the application code symmetrically on all four cores. Each thread runs the same loop, reading packets from the device and processing these packets independently through the Snort dataflow. This approach is shown in Figure 3.
Implementing this approach required some effort to make the data structures in Snort thread-safe. For example, Snort has a large TCP reassembly table in one of its preprocessors. This data structure reassembles TCP segments from various TCP connections into larger buffers, which are passed through the detection engine to look for attacks. This data structure can't be safely read while it simultaneously writes from multiple threads, so we modified the code to acquire a mutex before accessing the data structure and to release that mutex after the code is done accessing the data structure.
We compared the performance of this approach with the original single-threaded version shown in Figure 4.
The results, shown in Tables 1 and 2, show little performance change.
The processing of both packet traces performed slightly worse on four cores than on one. Based on the analysis we conducted using the Intel's VTune Performance Analyzer, we concluded that two primary reasons explain this disappointing performance. First, the code spent a lot of time blocked on synchronization for the TCP reassembly data structure. Because this time isn't being spent doing productive work, performance suffers. Second, the TCP reassembly data structure is frequently written by all threads, which has an impact on cache efficiency. In many cache coherency designs, when a processor core writes to a cache line, that line must be invalidated in all other caches to avoid inconsistencies. Thus, subsequent reads or writes to this data by other cores result in cache misses. Because Snort performs frequent writes to its TCP reassembly data structure, sharing this data structure between cores greatly decreases cache efficiency.
To alleviate these performance issues, we functionally decomposed the application using functional pipelining and flow-pinning, which ran on four cores, as shown in Figure 5. Again using the pthreads library, we pipelined the first Snort stage, Packet Capture, and ran it on one execution core.
Functional pipelining is a technique that subdivides application software into multiple sequential stages and assigns these stages to dedicated execution units. The first execution unit runs the first application stage and then passes the intermediate results to a second execution unit that runs the next application stage, and so on. Pipelining can increase reference locality because each execution unit runs a subset of the entire application, potentially increasing the cache hit rate associated with executing instructions.
In this case, we passed packets between threads using a simple circular buffer in shared memory. Because threads share the same address space, we implemented the circular buffer as an array of packet pointers and didn't need to store or copy the packet data itself. And because each buffer has only one reading thread and one writing thread, the enqueue and dequeue operations can be performed without synchronization. For example, the simplified code snippet in Listing 1 shows the enqueue operation. This operation can be done while another thread is reading from the circular buffer but requires synchronization if another thread is writing to the buffer.
The remaining Snort functional processes all run on cores 2, 3, and 4 and TCP flows are assigned to these cores using flow-pinning. Flow-pinning is orchestrated by the hash function running on core 1, which directs network traffic to cores 2, 3, and 4. The hash function spreads TCP flows around the cores as evenly as possible while minimizing the load-share mechanism's computational complexity. It simply performs an exclusive OR (XOR) of the relevant TCP fields, as shown in Listing 2.
The result of the XOR is divided by a prime number, with the remainder of this operation producing the hash table index. This is a value between 0 and 250, and is coded as hash % PRIME_NUMBER , where 251 was used for the prime number. This result is used as an index into a table, Table 3, which assigns each traffic flow to one of three execution cores. With packets from the same TCP flow processed by the same core, it's possible to improve the locality of reference and cache utilization.
Distributing packets based on their TCP flow delivers some performance improvements in TCP reassembly. Instead of having one large TCP reassembly table, we split the table into three without affecting the accuracy of the application. This enabled us to remove the synchronization required in our first parallel version of Snort. It also made the lookup faster, as each lookup table has only a portion of the total set of TCP connections.
Having less connections in the table not only reduces the number of tree traversals required to find a matching element, but it also reduces the size of the cache working set, increasing cache efficiency. In this way, flow-pinning is somewhat analogous to a technique used to register conference attendees in the real world. Typically, attendees are asked to form lines based on an alphabetical range of their last names. This simple initial “classification” reduces the work required by the people performing registration. They each have a shorter list of names in which to look up attendees, and they have no need to coordinate the registration activities with their peers.
The performance results of this approach were better than either the single-threaded approach or the approach that didn't use pipelining or flow-pinning, especially for packet traces with large numbers of concurrent TCP connections.
We tested each of the three configurations with two sets of packet traces. We obtained one trace from a database server within Intel and had relatively few concurrent TCP connections, about 175. The other trace was obtained from the National Laboratory for Applied Network Research (NLANR) on a link that services a wide-area network and had a relatively large number of TCP connections, about 25,000. Note however, that some large enterprise or service provider security deployments may service traffic streams on the order of 200 Mbytes/s and TCP connection flow-counts in the high hundreds of thousands.
Intel's VTune Performance Analyzer measured L2 cache hit rates. The analyzer enables run-time sampling with minimal disruption to program execution. Testing indicated that average L2 cache hit rates and associated throughput varied significantly with the number of connections serviced. For the trace with a small number of connections, performance was similar for single and multiple core configurations. In addition, implementation of pipelining and flow-pinning techniques yielded only minimal incremental throughput as shown in Table 1.
This result may be explained by the ability of the single-core system to maintain a 99% L2 cache hit rate while servicing the low number of TCP connections. The multiple-core systems incurred the additional cost involved in queuing packets among the cores.
Results were significantly different for the trace with a larger number of TCP connections. In this scenario, as Table 4 shows, the L2 cache hit rate on the system with a single execution core decreased to 41% (from 99%) and the throughput decreased dramatically.
We observed a similar performance decrease when four execution cores ran the application in parallel without pipelining or flow-pinning. Analysis showed that the application was spending much of its time competing for access to the TCP flow reassembly table. However, when pipelining and flow-pinning techniques were applied using four execution cores, the resulting throughput was more than 6.2X greater than the single-core system. This configuration also had a substantially greater L2 cache hit rate than the single-core system (70% versus 41%) as shown in Table 2.
Results indicate that multicore processors can significantly increase performance for packet-processing applications, such as intrusion detection. But care must be taken to get maximum system performance. Reducing synchronization and paying close attention to the cache effects of sharing data structures can improve the performance of parallelized code. Pipelining and flow-pinning are techniques that worked well for Snort, and have the potential to apply to other flow-based networking applications. For Snort, running the application in parallel across four execution cores using pipelining and flow-pinning generated more than 6X the throughput of a single-core configuration. In a test environment approximating “real-world” network traffic (25,000 TCP connections), this hardware/software combination increased the L2 cache hit rate significantly. Software tuning to increase L2 cache hit rate beyond the 70% recorded here could potentially yield even greater performance improvement.
Edwin Verplanke is a platform solution architect and is part of Intel's Infrastructure Processor Division. He holds bachelors in computer science and a masters in electrical engineering. For the past 12 years Edwin has focused on communications board design, participated in standards development covering high-speed interconnects, and more recently, researching multicore architectures for the embedded market. He can be reached at firstname.lastname@example.org