Analyzing multithreaded applications—Identifying performance bottlenecks on multicore systems - Embedded.com

Analyzing multithreaded applications—Identifying performance bottlenecks on multicore systems

This Multicore Expo paper accompanies the class “ME865: Analyzing Multi-threaded Applications—Identifying Bottlenecks on Multicore Systems” to be held on May 3, 2011 in San Jose, CA.

Abstract
Various aspects preventing applications from achieving theoretical maximum utilization of multicore resources include: operating system (scheduling, synchronization, etc.), application code (parallelization factor, data/function decomposition, etc.), and hardware architecture scalability (cores, memory subsystem, interconnects, etc.).

This paper is from a class at the Multicore Expo 2011. Click here for more information about the conference.

We use various multithreaded execution scenarios generated through EEMBC's Multibench as stimulus. We introduce a step by step methodology to analyze these scenarios and identify the bottlenecks. Techniques used for kernel tracing, time/function profiling, etc. and tools used to deploy the methodology are discussed next. The paper ends with discussion of various case studies representing different bottlenecks.

Introduction
Traditionally increasing operating frequencies and decreasing device geometries have satisfied ever increasing hunger for processing power. However, this approach has also created power wall and memory wall [1] which have made it impossible to make economically viable products. The initial advantages gained by exploiting instruction level parallelism (ILP) have started to fade away. These factors, along with the fact that most of the real world applications are naturally parallel, have lead to an irreversible trend to switch from uniprocessor systems to multicore systems. Today multicore architectures are invariably used across various application domains including high performance computing, general purpose computing, embedded, network, graphics etc.

One of the biggest challenges while moving from sequential to parallel computing is to be able to extract full performance gain offered by the multicore system. A lots of applications fall short of achieving theoretical maximum utilization of multicore resources. The root cause could lie across various levels of system viz.

  • Hardware Layer: Performance of cores, scalability of interconnects and memory subsystem.
  • OS Layer: Scheduling overheads, load imbalance, synchronization overheads etc.
  • Application Layer: Parallelization factor of code, Data/Function decomposition etc.

A good methodology to analyze multithreaded execution and identify various bottlenecks must encompass all the layers. It not only helps application developers to write better performing application, it also helps system architects to define better products. We present a detailed step by step methodology to analyze multithreaded applications and identify performance bottlenecks.

Choice of benchmark
The preferred method to study the performance of a system is to study how the system behaves under stimulus of a benchmark workload. To develop an effective performance analysis methodology it is important to choose a representative multithreaded benchmark. The desired characteristics for the benchmark are:[2]

  1. Ability to generate wide-ranging execution scenarios: As performance bottleneck could lie across various levels, it is necessary that benchmark is able to generate a wide range of behaviors viz.
    • Concurrent execution, data decomposition, function decomposition.
    • Dependencies—data sharing and data exchange.
    • Synchronization—coarse grained to fine grained granularity.
    • Stretching the memory subsystem—high amount of data traffic, various access patterns and various data set sizes.
  2. Portability: Benchmark should be easy to port and run on different platforms, including cycle accurate simulators.
  3. Real life applications: It is an advantage if the benchmark imitates real life applications.

EEMBC MultiBench [3] satisfies above requirements and it is our choice of benchmark. MultiBench is a suite of benchmarks created by EEMBC to measure and analyze the performance of platforms capable of executing in multiple hardware contexts. MultiBench provides multiple workloads which can be used to exploit strengths/weaknesses of multicore architectures for various parallel execution and multithreaded execution scenarios. The next section provides brief introduction of EEMBC MultiBench terminology which will be used throughout rest of the paper.

A. EEMBC MultiBench terminology

  • Kernel —An embedded algorithm

  • Dataset —Input to a kernel.

  • Work Item —A kernel working on a specific dataset.

  • Workload —A collection of work items.

  • Workers —Multiple software threads/processes working within one kernel in parallel, on a single dataset.

  • Contexts —Multiple copies of the same kernel working in parallel (on different datasets).

  • Scaling with Workers (Parallel Mark) —Only one copy of a Work Item, but the number of workers (threads) working on the Work Item Dataset in parallel is flexible – Primarily covers data decomposition.

  • Data decomposition —demonstrates a processor's support for fine grain parallelism. w =N execution scenario indicates that there are N worker threads working in parallel.

  • Scaling with Contexts (Multi Mark) —Multiple copies of the same Work Item executed concurrently on different hardware resources (e.g., cores), each processing a separate Dataset- demonstrates how well a solution can scale over scalable data inputs. c =N execution scenario indicates that there are N contexts working in parallel.

Figure 1. Typical MultiBench Benchmark Workload Format (This figure shows an example using pthread. MultiBench also provides several other means including OpenMP.)


Click on image to enlarge.

Analysis steps and tools

  1. Measurement: If the performance of the benchmark workload in w =1, c =1 scenario is P1 and performance of the benchmark workload in w =N or c =N scenario is PN then scaling  = PN /P1 . Both PN and  are important performance indicators. The measurement of PN and  is made by running the benchmark workload on given silicon or performance model.
  2. Set the expectation: This step includes finding the ‘parallelization factor ' of the workload of interest and calculating the expected performance at w =N or c =N scenario. This is done through manual code inspection function profiling and time profiling through oprofile
  3. Situate the bottleneck: This step includes finding the time distribution of w =N or c =N execution and identifying the preliminary location of bottleneck (kernel or user space). This is done through time profiling using vmstat and oprofile.
  4. Kernel tracing and OS overhead extraction: If the preliminary location of the bottleneck is in kernel space, this step is carried out. It includes collecting kernel traces and analyzing the sequence of events. Kernel traces are extracted with LTTng instrumentation provided through WindRiver SystemViewer. Analysis of these traces is carried out with self developed independent post processing scripts.
  5. Hardware profiling: If the preliminary bottleneck lies in the user space, this step is carried out. It is done through collecting performance monitor statistics for various parts of hardware and analyzing them to figure out the hardware bottleneck. This is mainly done with proprietary tools.
  6. Validating the conclusion: If required, conclusion is validated by changing relevant hardware settings or OS configurations and observing their effect on performance.

Setting the expectation
A. Finding the parallelization factor
Parallelization factor (p ) is defined as the ratio of the total code execution time, as measured in sequential execution scenario, which can be executed in parallel. First step towards finding parallelization factor is to prepare a function call graph of the workload of interest and identify the functions that form serial portion of code and parallel portion of the code. This can be done with a profiling tool that can provide static function call graph. In this case it is done through oprofile and some manual code inspection. The next step is to find out time spent in each of the functions, including the library function calls in case of dynamically linked libraries. This is also done through time profiling with oprofile.

Figure 2. Example of a static function call graph


Click on image to enlarge.


Click on image to enlarge.

B. Estimating the performance
The fact, that there is a portion of code that cannot be executed in parallel, indicates it is not possible to achieve perfect scaling in every case. There are a couple of established methods to predict performance in multithread execution scenarios: 1. Amdahl's Law 2. Gustafson's Law.[4,5] Amdahl's law makes an underlying assumption that the problem size remains the same, both in sequential execution and parallel execution scenarios. Gustafson's law assumes that the problem size increases in parallel execution scenario. In case of MultiBench, generally speaking the problem size, w.r.t. the user work, remains the same in w =1, c =1 and w =N or c =N cases. So Amdahl's law is followed to predict performance in w =N or c =N cases.

Figure 3. Amdahl's Law


Click on image to enlarge.


However, scaling as defined by Amdahl's law sets up an optimistic limit on on performance at w =N or c =N because it does not account for possible kernel work increase to manage and schedule N threads, possible overheads due to dependencies between threads, possible user time increase due to hardware bottlnecks, possible load imbalance and work imbalance.


Extracting kernel overheads
If the preliminary profiling shows that the bottleneck is in the kernel space, further investigation of kernel behavior is done with the help of LTTng traces collected during workload execution. These LTTng traces are generated through WindRiver SystemViewer logs. Log shows various interrupts, kernel events, system calls, traps etc. However, there is no straight forward way to extract relevant information regarding time spent doing various tasks out of it. Each log entry typically consists of event name, corresponding time stamp, cpu id on which it's getting executed, context which calls this event and some relevant parameters passed to/from the event. Independent post processing scripts are run on these logs to identify predefined sequences and categorize these sequences to identify various kernel overheads. Following useful information is extracted through this analysis:

  • Number of threads spawned by the user process.
  • % time spent in thread creation.
  • % time spent in thread scheduling.
  • % time spent in various traps, syscalls and interrupts.    
  • % time spent by a thread waiting for an event/locks etc.
  • % time spent in useful execution by a thread.
  • % of total process time spent on each core to identify load imbalances.

A. Multithread execution overheads

Figure 4. Multithread execution overheads


Click on image to enlarge.

For a MultiBench workload executed for M iterations and w =N workers or c =N contexts (N parallel threads), for a fully subscribed system:

Total time spent in executing serial part of the code:

Total idle induced due to serial part of the code:

Total idle because all the threads did not start execution at the same time:


Total idle due to barrier:


B. Synchronization overheads
User level synchronization primitive like multexes, spinlocks, barriers, etc., get translated to futex system call.6

Figure 5. Synchronization overheads


Click on image to enlarge.

Total idle induced due to synchronization:

Total kernel time spent handling synchronization:


Click on image to enlarge.

Total idle accounted for:


Click on image to enlarge.

C. Syscalls, traps, and interrupts overheads
Kernel also handles various file handling system calls, memory management system calls, various traps to handle page faults, timer tasks, various soft and hard irqs, etc. These overheads are calculated in a similar way as shown by above equations.
Analyzing for hardware bottlenecks
If the initial time profile shows an increase in the user time, most likely it is due to some hardware bottleneck because in case of MultiBench moving from w =1, c =1 to w =N or c =N does not increase user work. So increase in user time typically indicates decrease in IPC. There are various possible bottlenecks in hardware and one needs to carefully browse through the system to identify the bottleneck. Typical starting point is to find out the ‘hot spot' in the application code through time profiling. The next step is to determine the preliminary characteristics of the code through instruction mix, cache miss rates, etc. These two steps give a good idea about further analysis direction. For example, for a memory-intensive workload, next steps are to collect performance monitor statistics for higher level caches, interconnect latencies, DDR utilization, and so forth.

Most of the hardware analysis is done through collecting statistics from performance monitor counters available at various places on the system. We use oprofile and proprietary tools for this purpose.

If required the conclusion can be validated through changing code or changing cache setting or changing DDR settings or changing operating frequencies etc. and measuring the sensitivity of performance to these settings.

Case studies
The numbers and statistics shown in this section do not represent best-case performance of the device. They merely represent scenarios generated to create bottlenecks.

A. ipres w =8
Parallelization factor of the workload p = 0.71


Time profiling shows that for w =1 user time= 9.14 seconds, for w =8 user time = 10.63 seconds. But at w =8 there is a huge kernel overhead. So kernel overheads are extracted first.

Figure 6. ipres w =8 time distribution


Click on image to enlarge.

Table 1. ipres w =8 kernel overheads


Click on image to enlarge.

ipres case study clearly shows diminishing results yielded by poor parallelization factor.

B. ippktcheck w =8
Parallelization factor of the workload p = 0.87


Time profiling shows that for w =1 user time= 5.07 seconds, for w =8 user time = 10.46 seconds. This shows 106.1% increase in user time. Also at w =8 there is a huge kernel overhead. So as the first step kernel overheads are extracted.

Figure 7. ippktcheck w =8 time distribution


Click on image to enlarge.

In ippktcheck case, not all the iterations get executed the same way. Following example iterations illustrate this in more detail:

  1. Example iteration 1: Only one thread gets scheduled in time and finishes checking of all the headers in the rx_queue. Rest of the threads, when they get scheduled, see the queue as empty and exit.
    a) Time taken to complete iteration: 0.00318 seconds
    b) Synchronization overheads: 0%
  2. Example iteration 2: Three threads get scheduled in time and finish checking of all the headers in the rx_queue. Rest of the threads, when they get scheduled, see the queue as empty and exit. The threads that get scheduled waste a lot of time getting packets from/to queues due to synchronization.
    a) Time taken to complete iteration: 0.00418 seconds
    b) Synchronization overheads: 31% (for the threads that process headers)
  3. Example iteration 3: All threads get scheduled but they waste a lot of time getting packets from/to queues due to synchronization.
    a) Time taken to complete iteration: 0.00431 seconds
    b) Synchronization overheads: 48%

The above examples show a clear trend that, higher the number of threads, lower is the performance due to increasing synchronization overheads. They also show that scheduler is always not able to schedule eight threads in time.

Table 2. ippktcheck w=8 kernel overheads


Click on image to enlarge.

Also further time profiling indicates that user time increase is attributed to increase in time spent in  handling synchronization function calls.ippktcheck code has very fine grained synchronization with a very low computation to synchronization ratio. Hence synchronization becomes one of the major bottlenecks by increasing kernel as well as user space overheads.

This workload serves as an extreme case that exposes limitations of Amdahl's law.

C. ipres c=8
ipres is a memory intensive workload. Instruction mix show that load/store make up for 48% of executed instructions. Profiling shows that moving from c =1 to c =8, there is 22% increase in user time and corresponding decrease in IPC. Also there is 20% increase in cache misses and data traffic going to DDR. This combined with the fact that at c =8, rate of traffic going to interconnect and DDR also increases by 8x. This scenario hence creates bottlenecks in memory subsystem and increases memory access latencies. This has been validated through statistics derived from our proprietary tool as well as through indirect methods like checking the ipres sensitivity to DDR frequency, various DDR performance enhancement configurations, ipres data set size, etc.

Conclusion
In summary, we have demonstrated a comprehensive methodology to identify performance bottlenecks in multicore systems. We have also presented a way to extract kernel overheads impacting multithreaded execution. We have put this methodology to work to accurately explain performance of the system under various multithreaded execution scenarios. The case studies have shown limitations of theoretical estimation methods and have proved the need to analyze the entire system to identify the bottlenecks.

Nandan Tripathi is a senior design engineer at Freescale Semiconductor. He is currently a part of the System Performance Analysis team. His current focus is on analysis of multicore systems and multithreaded applications. In past he has worked on system architecture of wireless systems at Freescale Semiconductor. He has authored/co-authored various patent applications. Nandan holds a master of science degree from University of Southern California.

Amrit Singh is a design manager working at Freescale Semiconductor. He has been a part of Motorola/ Freescale for more than 10 years, working in streams like IP micro-architecture / design / verification, wireless PHY algorithm design and most recently Performance Analysis of Networking Processors. Currently, a manager in System & Performance Analysis team in India, he leads the core benchmarking and analysis efforts. He holds several patents and publications in IP design and Wireless communications. Amrit holds a bachelors degree in electronics and communications engineering from Panjab University, Chandigarh.

References

  1. Krste Asanovic, et al. “The Landscape of Parallel Computing Research: A View from Berkeley,” A Technical Report by EECS University of California at  Berkeley, December 2006.
  2. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. ”The PARSEC Benchmark Suit: Characterization and Architectural Implications,” Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques , October 2008.
  3. Shay Gal-On. ”Multicore Benchmarks Help Match Programming to Processor Architecture,” Multicore Expo, April 2008.
  4. Amdahl. “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” 1967.
  5. Gustafson. “Reevaluating Amdahl's Law,” Communications of the ACM 31(5) ,1988.
  6. Ulrich Drepper. “Futexes are Tricky,” August 2009.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.