To provide some perspective on what we discussed in Part 1 , Part 2 , and Part 3 , in this last part in this series, we will consider the important topic of characterization of applications and architectures.
To this end, trace-driven simulation is widely used to evaluate computer architectures and are useful in MPSoC design. Because we know more about the application code to be executed on an application-specific MPSoC design, we can use execution traces to refine the design, starting with capturing fairly general characteristics of the application and moving toward a more detailed study of the application running on a refined architectural model.
Because video applications are computationally intensive, we expect that more than one platform SoC will be necessary to build video systems for quite some time. For some relatively simple applications, it is possible to build a single platform that can support a wide variety of software.
However, fitting a video application on a chip often requires specializing the architecture not just to generic video algorithms but more specifically to the types of operations required for that application.
As we design video SoCs, we need to evaluate both the application and the architecture. At the early stages of the design process, we want to evaluate the application in order to understand its characteristics.
As we add detail to the design, we want to evaluate the application's properties while running on the candidate architecture. And as we design SoCs for leading-edge applications, we will use all the available computing resources.
We can squeeze extra capability out of the system with some specializations, but we need to know where to specialize the architecture. Of course, we also want the architecture to be as flexible as possible since video algorithms are still evolving.
Overspecialization may prevent an SoC from being adapted to a new algorithm, limiting its lifetime. So we need to characterize the application carefully to understand the right points for architectural assists and where more general-purpose solutions can be used.
Studies of programmable media processors have used used the Impact compiler and simulation system and the MediaBench benchmark set to study both the properties of multimedia applications and processor architectures for multimedia. A few results are particularly interesting for this discussion.
First, it has been found that nearly 20% of operations are branches and that the average size of a basic block was 5.5 operations. The average size of basic blocks varied widely across the MediaBench set.
The mean basic block size was consistent with general-purpose applications, but MediaBench showed a much wider variance. Static branch prediction was found to work well; when the pgpcode benchmark was removed, the average static hit rate was 88%.
A great deal of total execution time was spent on the two innermost loop levels; loops averaged about 10 iterations per loop. The path ratio, the average number of instructions executed per iteration divided by the total number of loop instructions, was found to be lower than expected. The average path ratio was 78%, indicating that multimedia applications have a significant amount of control in their loops.
Also studied were various instruction issue mechanisms including VLIW, in-order superscalar, and out-of-order superscalar. These experiments showed that out-of-order processors performed significantly better than either in-order or VLIW processors. A study of instructions per cycle (IPC) as a function of instruction width showed that issue width flattened out relatively quickly, around 5, and confirmed that out-of-order was significantly better than either in-order or VLIW.
As we refine the SoC architecture, we want to evaluate our target applications on the architecture under consideration. An MPSoC for an advanced video application such as video analysis will require multiple processors; these designs are natural candidates for networks-on-chips.
Network-on-chip design requires careful evaluation of the utilization of several candidate network architectures. Because activity in advanced applications is data-dependent, trace-based evaluation is an important methodology.
We can perform some evaluation early by simulating each stage relatively independently or by executing it on a general-purpose processor. Even if this evaluation is not done with detailed timing information, it is still important in two senses.
First, it gives communication traffic patterns, which are the base of partitioning a whole system into a collection of subsystems. Second, if a netlist has a tight performance without communication delays, it won't satisfy performance requirements with them.
Let us first use traces to evaluate the performance of several motion estimation algorithms, which would constitute one subsystem in a video compression system. Figure 14-10 below compares the performance of seven popular motion estimation algorithms, in terms of clock cycles per block matching.
|Figure 14.10. Performance of several motion estimation algorithms on a number of benchmark videos.|
The seven motion estimation algorithms are modified log search (MLS), diamond search (DS), four-step search (FSS), three-step search (TSS), one-dimensional full search (ODF), subsampled motion field search with alternating pixel decimation (SAPDS), and full search (FS).
Ten standard test sequences are used in this step, and they are Coastguard, Foreman, Hall Monitor, Container, Football, Garden, Mobile, Tennis, Carphone, and Claire. Sim-outorder is used to collect the timing information. It is a simulation tool that belongs to SimpleScalar, a suite of public domain simulation tools, and can generate timing statistics for a detailed out-of-order issue processor core with a two-level cache memory hierarchy and main memory.
In Figure 14-10, all the data are normalized by the simulated results of an MLS algorithm. For example, the clock cycles per block matching for FS algorithm running the Coastguard sequence is normalized by that of the MLS algorithm running the same sequence. The figure for FS is truncated, and the value is 158.3.
The traditional way to compare search speed is just to compare the search points per block matching of different algorithms. However, this method is not accurate. For example, when compared in terms of search points per block matching, the speed ratio between FS and MLS is 19:1, on the other hand, when compared in terms of clock cycles per block matching, which exactly means running speed, the speed ratio is just 15:1.
Moreover, the SAPDS algorithm, which uses 4.8 times more search points than the MLS algorithm, is the fastest algorithm when executed in the SimpleScalar simulator, as shown in Figure 14-10. This is because SAPDS uses the pixel decimation pattern when doing the criterion calculation.
The above result strongly suggests that number of search points is not a good measurement to compare the speed performance, as is commonly used by most reported works. For MLS, DS, FSS, TSS, and ODF algorithms, the central search point is not known until the last search point is checked in the previous step.
Then the data for the next step cannot be preloaded into the cache before the beginning of the search. What's more, there are many branches in these algorithms to determine which location around the central search point should be used in the search.
These two aspects waste many clock cycles when doing the block motion estimation for MLS, DS, FSS, TSS, and ODF algorithms. Instead, FS and SAPDS algorithms avoid these problems and spend less clock cycles per search point. SAPDS uses pixel decimation and spends the fewest clocks to finish one block matching.
The Global Communications bottleneck
Because global communications is the performance bottleneck in SoCs, designers must pay attention to global communication traffic and localize communication within a small part of the network as much as possible. By grouping highly cooperative IP cores into subsystems, most communication traffics are localized.
Localization is based on the traffic information gathered in the last step. As the communications are localized, the whole system is also partitioned into a collection of subsystems. These subsystems must be small enough that data can cross it in one clock cycle.
This requirement is compatible with the communication localization requirements, and more importantly it helps to avoid asynchronous designs, which are usually very hard, in subsystems. Actually, asynchronous design is limited in global network designs.
After the whole system is partitioned into subsystems, each subsystem can be separately designed, and a global network, which connects all the subsystems, is also designed. Each subsystem has its own clock and forms a local network.
Busses could be main actors in local networks because they are synchronous and work efficiently in small areas. The communication between one subsystem with another is handled by communication interfaces. The interface is an agent that translates protocols and buffers data between a subsystem and the global network.
Global network is the networks-on-a-chip we discussed before. It connects all the subsystems and transfer data between them fast and efficiently. Its nodes are synchronous subsystems, but it works asynchronously.
This is a globally asynchronous and locally synchronous scheme. A further verification and performance estimation can test the results of the subsystem and global network design. A problem found in this step may mean that the design must be revised, and it may need several iterations before all the problems are solved.
One could simulate a multiprocessor system by building a simulator that models the whole system. In practice, the method is usually not an effective solution. Few multiprocessor systems can be modeled by existing simulators. As a consequence, developers have to develop a simulator, which results in high costs.
In addition, when modeling details of the whole system, a simulator can be very slow to execute and results in high time cost in development. In our design process, we take a different approach. The whole system is evaluated in a hierarchical method.
An architectural design is segmented into several components. Each component is evaluated by a detailed simulator. At the system level, the evaluation results and communication data between different components feed into a high-level simulator, which gives an overall evaluation of the system.
We often start the design process with a reference implementation that we can execute on a uniprocessor. We can use this uniprocessor model to generate abstract traces of system behavior.
The abstract trace will not reflect detailed architectural behavior such as timing, but it does capture the inputs and outputs of the subsystems as they interact. We can add detailed information about timing, cache behavior, and so on, a subsystem at a time.
By running a subsystem's abstract trace through a more accurate architectural simulator, we can generate additional detail as we need it without running the entire trace through a relatively slow detailed simulator.
Figure 14-11 below shows the traffic as a function of time in the network-on-chip for a multiprocessor designed to support smart camera algorithms such as the Princeton gesture recognition application.
|Figure 14.11. Network throughput as a function of time for a smart camera MPSoC.|
The trace used to create this plot was generated in two steps: we first collected frame-by-frame results of each stage of the smart camera application; we then used SimpleScalar to simulate each stage in order to provide detailed timing information. After this, the trace files and the collected performance data are used in the Opnet simulation environment to analyze the communication costs and overall system performance.
Looking at the network activity allows us to evaluate how utilization varies as a function of time; workloads vary as each stage finishes the work on a frame at different times, depending in part on the data in that frame. We are also interested in overall performance, which we can evaluate by aggregating the trace data. Figure 14-12 below shows overall performance for several candidate architectures for the smart camera MPSoC.
|Figure 14.12. Measured performance for several candidate smart camera MPSoC architectures.|
Although imagination and innovation have constantly pushed the envelope, leading to revolutionary ideas, new standards, novel algorithms, fancy displays, personalized services, and cooperative systems with blurring product boundaries, function and design complexity have also grown exponentially, leading to increased development costs and time to market.
It is thus becoming increasingly clear that business success, in the new era of networked multimedia systems, will be determined by the ability to offer complete system solutions based on a combination of hardware and software, the agility to react quickly to changing market conditions, the capacity to serve more markets with fewer products, and the acuity to offer the right solution for the right market at the right price.
To meet these challenges, an emerging trend is to resort to a platform-based SoC design approach that achieves functionality through co-design, shortens development time through reuse, provides flexibility through software, maintains continuity through architecture, and amortizes cost through derivatives.
To read Part 1, go to Architectural approaches to video processing .
To read Part 2, go to Optimal CPU configurations and interconnections
To read Part 3, go to Critical Communication Bus Structures .
This series ofarticles is based on copyrighted material submitted by Sanatanu Dutta,Jenns Rennert, Tiehan Lv and Guang Yang to “ MultiprocessorSystems-On-Chips edited by Wayne Wolf and Ahmed Amine Jerraya. It is used with the permission of thepublisher, Morgan Kaufmann, an imprint of Elsevier. The book can bepurchased on-line .
SantanuDutta is a design engineering manager and technical lead in theconnected multimedia solutions group at Philips Semiconductor, now NXPSemiconductor. Jenns Rennert is senior systems engineer at Micronas GmBH. Tiehan Lv attended PrincetonUniversity where he received a PhD in electrical engineering. He alsohas B.S. and M.Eng. degrees from Peking University. Guang Yang is a research scientistat the Philips Reserch Laboratories.