Great hopes are being pinned on multiprocessor and parallelarchitecture as the answers for the continuing development ofelectronics and computing. Microprocessor designers have acknowledgedthat they can no longer rely on higher clock rates and increasing instruction-level parallelism (ILP)for performance increases.
Absolute performance, increasing power and rising cost mean thatrunning faster passed the point of diminishing returns long ago. Mostof the industry would agree that multicore is the way forward – thatthe principal challenges of multicore design have been successfullyovercome, and the move to practical deployment can begin.
Main multicore drivers
There are two main drivers behind the move to multicore technologies.The first is that it has become clear that “the real world isparallel.” When computer scientists try to identify primitive functionswhich they can use as universal building blocks for more complexprograms, they invariably find that these building blocks areinherently parallel processes.
Moreover, the most rapidly expanding sectors of the electronicsmarket – media processing and data compression – are exactly the areaswhere this parallelism is most pronounced.
Progress has been made in helping the designer to exploit thisnatural confluence of application requirements and parallelarchitectures. Mainstream processors from Intel Corp. and AdvancedMicro Devices Inc. are moving to dual or quad processors which areloosely coupled, allowing performance gains approaching 2X and 4X,without dramatically changing the programming model.
But the need for greater performance calls for increasing thenumbers of cores – and that impacts the programming model. Multicoreand parallel processing systems have traditionally been perceived asextremely hard to program, requiring special-purpose tools (Figure 1, below ) and expertknowledge, and this has been the clichéd reason why multicoreprocessors have historically failed.
|Figure1. The ability of tools to automatically configure low-level details ofthe parallel elements, to automatically allocate tasks to cores, and toconfigure the interconnect is critical.|
However, products such as multicore DSPs can be configured andprogrammed using standards-based tools that are intuitively understoodby chip designers and programmers.
The second driver for the adoption of multicore techniques is theslowdown of advances in uniprocessor development, despite clock ratescurrently standing at 3GHz and transistor counts moving into thehundreds of millions.
From 1986 to 2002, microprocessor performance rose by 52 percent peryear, effectively doubling every 18 months. By 2006, the rate ofadvancement has slowed to less than 20 percent per year, so that today,performance doubling may be taking as much as five years.
There are many reasons for this slowdown. For a start, systemarchitects are no longer able to wring further gains from ILPtechniques. At their most basic, these involved tricks such as simpleinstruction pre-fetches.
But they now encompass very complex techniques such as outof- orderexecution and branch prediction. In many cases, the increasedcomplexity outstrips the performance gained. Replacing ILP with task-level and word-level parallelism is theonly way to reap further gains.
Power is another area of uniprocessor development where the equationhas changed. In sub-90nm processes, active power density – alreadyreaching the 100W/cm2 levels found in a nuclear reactor andsoon to rise to the 1,000W/cm2 found in space rocket nozzles- is not the only limiting factor. Static power due to leakage currentcan now represent up to 40 percent of a chip's total power dissipation.
Parallel architectures solve the power problem on more than onelevel. First, they have been shown to be inherently energy efficientmethods for performing a given function, especially if they are made upof diverse function blocks designed with particular applications inmind.
But further, a fine-grained multicore architecture lends itselfnaturally to modern power management techniques such as clock-gatingand localized power-down: any element that is not actively processingcan be temporarily shut down.
This makes it possible to deal more intelligently with both activepower consumption and the static leakage current problems, which occurwith modern manufacturing processes. Multicore devices also help solveanother problem inherent in the use of advanced semiconductorprocesses: devices are becoming less, not more, reliable.
At 65 and 45 nanometers, in particular, “pass-fail” methods aregiving way to statistical assessments of performance. In addition,devices made with such processes are more prone to hard and softerrors.
Multicore architectures lend themselves naturally to redundantdesign techniques – familiar for some time in memory production – whichallow out-of specification or faulty sections of the device to beshut-down. One microprocessor maker already sells four-core, six-coreand eight-core versions of one of its chips, all based on a singleeight-processor design.
Power and statistically varying performance have also had anindirect impact on the recent progress of uniprocessor systems, byinhibiting chip-makers' ability to gain performance via increased clockspeeds. The current maximum of 3GHz has proved to be a practicalceiling to the clock-speed ratchet that worked for processormanufacturers from 1979 onwards.
Parallelism, however, holds out the promise of restoring thebenefits of successive process shrinks by enabling manufacturers todouble the number of standard cores on each chip with every processgeneration.
|Figure2. Multicore and parallel processing systems have traditionally beenperceived as extremely hard to program, requiring special-purpose toolsand expert knowledge.|
Making these extra cores do real work, of course, is a question ofdesigning an appropriate architecture, and this is at least as muchdown to the inter-processor communications infrastructure as the designof the computing elements themselves. This is equally dependent on thedesign flow and programming tools, which must support a range of arraysizes within a single environment (Figure2, above ).
The ability of tools to automatically configure low-level details ofthe parallel elements, to automatically allocate tasks to cores, and toconfigure the interconnect is critical. A usable multicore designenvironment enables the programming to focus attention on the designelements themselves, not on the precise details of exactly how they areimplemented.
This is in contrast to the FPGA, where engineers must deal withtiming closure and details of behavioral synthesis.
These issues are bound up with another sea change that has takenplace in computing over the last few years – the performance of manyprocesses is now limited by their ability to move data, not just bycomputing horsepower.
A DRAM access may take 200 clock cycles, whereas a floating-pointmultiply can often be achieved in four cycles. The cost is not justmeasured in time, and using local registers is an order of magnitudemore energy-efficient than resorting to global memory accesses, which,in an energy-constrained environment, can be key.
The designer of a multicore system, therefore, can solve many problemsby choosing the right communications architecture, and implementing thecorrect balance of registers, local and global memory resources. Anefficient communication fabric can even substitute for memory accesses,by allowing core-to-core data transfer.
Individual tasks are assigned to processors on a one-to-one basis,and the processes within each processor themselves are programmed instandard
The interconnections are configured by the engineer, allowingoptimization of communications according to the needs of the specificapplication. In essence, the programming model is that of a blockdiagram, where each block is self-contained and connected viaexplicitly defined signals.
The academic paradigm is “
One of the key bottlenecks in some architectures has been thebandwidth of the interconnect or, more subtly, the allowed complexityof signal flows. One example is the nearest neighbor connectivity thatvery quickly limits the processor usage.
In the case of the picoArray architecture, each array includes asquare mesh of 32bit communications links that incorporate switchmatrix elements at the junctions between the horizontal and verticallines. Each execution element has multiport access to the mesh. Bydefining the state of the switch matrix at compile time, the mesh canbe configured to allow any communication between elements, includingmulti-way structures such as fan-out and fan-in.
This approach provides dedicated, deterministic communicationbetween elements, with each viewed as running separate processes. Sincethe elements behave like “producers and consumers” automaticallywaiting until a result is valid when they encounter dependencies, theycan be treated from a programming viewpoint as asynchronous functioncalls.
There is no need for any form of bus arbitration, reducingcommunications overhead in terms of silicon area and speed of programexecution. This combination of communication resources and designinfrastructure means that tasks can be programmed, verified anddebugged in a modular fashion, in the knowledge that – as the system isintegrated – its parts will continue to operate exactly as they didwhen verified alone.
New model of abstraction
The final problem that multicore architectures have been called upon tosolve is a human one. For many years, chips have been very large as tomake it impossible to understand – and therefore design or use – themfrom the ground up.
Designers and programmers have relied on increasing levels ofabstraction in their understanding. At 65nm and below, however, thishas become all but impossible.
Signal integrity, clock jitter and many other small-scaleconstraints have come into play that have made it impossible to foundthe design of a new, larger chip on the groundwork and abstractionsbuilt from a previous generation.
Multicore architectures provide a new model of abstraction thatallows engineers to make use of the enormous numbers of transistors onoffer in sub-90nm chips. Moreover, design, verification and validationcan all become easier when dealing with smaller sub-units as long asthe design is then “scaled up” via a well-designed communicationsinfrastructure.
Peter Claydons is Co-FounderandChief Operating Officer, picoChipDesigns Ltd