Embedding HPC: A rocket in your pocket - Embedded.com

Embedding HPC: A rocket in your pocket

New embedded processors, single-board computers, and software development tools are enabling super-computing-like applications on embedded systems. Here are a few recent advances in HPC for embedded systems.

High performance computing (HPC) refers to running large, compute-intensive applications to solve complex numerical algorithms typically used in areas such as image processing and simulation. Some HPC applications include computational fluid dynamics, weather modeling, and circuit simulation. Trying to get a computer to emulate something in the real world requires a tremendous amount of computational power. Historically, if you were to actually see a supercomputer, you would find it housed in a large facility with rows and racks of blinking lights, heavy air conditioning, and maybe even a water-based cooling system.

Today, with the introduction of more compact and more powerful embedded processors, embedded systems are becoming HPC capable. In the evolution of processors, for many years we were following Moore's Law. That is, processors were doubling in clock speed and gate count every 18 months or so. But lately, processor speeds haven't moved from two gigahertz to six, seven, or eight gigahertz. Instead, what we're seeing is the number of multiple cores being expanded. There's a shift from the evolution of the individual processor “cell” along the lines of Moore's Law to the evolution of a fabric or “organism” composed of multiple processing cells. As a result, the latest laptops have four- or even eight-core processors, and perhaps some other type of graphics processor built into them. Figure 1 depicts the shift in evolution from processor core technology to processor fabrics.

Click on image to enlarge.

Traditionally, we associate embedded systems with microcontrollers. These systems conduct basic functions such as blinking lights, turning on relays, checking to make sure the thermostat is on, or turning on the air conditioner. But with the recent advances in computing platforms, it's now possible to run far more sophisticated software, enabling exciting new applications. For the U.S. military, HPC enables airborne vehicles to navigate without pilots. In the automotive sector, HPC serves as a key building block in the development of anti-collision systems. These systems can actually look out over the road, collect data on what's going on around the vehicle, disengage the cruise control, or sound an alert to the driver, depending on the severity of the event.

HPC is all about data processing using complex numerical algorithms to convert continuous real-time sensor data streams into images or actionable information. For home security systems, imagine a security system that actually recognizes you, even if you forget to turn off the alarm, and says “I have a positive ID that it's you entering the house,” and therefore it does not sound the alarm. In the healthcare industry, medical diagnostic imaging is a rapidly growing field. Large MRI and CT body imaging scanners take pictures and try to identify problem areas inside the body. Most of this is done by image processing—electronic signals collected by the scanners processed and formulated into an image that can highlight a problem area. In the past, a patient would wait a week for results; it's now available the next day. The medical industry is interested in making HPC devices even more portable, and possibly using them in the operating room as real-time tools to help guide surgery.

Scientists on the University of Manchester's SpiNNaker (spiking neural network architecture) project (http://apt.cs.man.ac.uk/projects/SpiNNaker/) are modeling the human brain to study the complex interactions and fault tolerances of neurons. They're interconnecting 1,000 processor cores to model a million neurons, which is still only one one-thousandth of the human brain, which has about 100 billion neurons. Work like this would not be possible if HPC hadn't evolved so quickly in the last couple of years.

Processor advances
All of this is possible due to advancements in processing hardware. What we're seeing now is what the military and aerospace community call commercial off-the-shelf or COTS, which usually connotes commodity-type devices that are capable of high-performance computing. Companies like Intel, Freescale, NVIDIA, Xilinx, and TI are creating an explosion of new devices targeted at HPC applications. Intel recently introduced its new multicore Sandy Bridge class of devices (2nd Generation iCore processors) with Advanced Vector (math) eXtensions called AVX. In the same timeframe, Intel has also introduced its new Many Integrated Cores (MIC) processor architecture. Code named “Knights Corner”, this architecture supports the interconnection of 50 Larrabee class cores. Freescale recently introduced a new generation of high-end multicore Power PC chips called the QorIQ AMP Series, with a re-introduction of an improved AltiVec vector processing accelerator. The new QorIQ architecture can support up to 24 virtual cores per chip.

In addition to these multicore plus vector accelerated CPU style processors, we're seeing several new and interesting devices. NVIDIA, the folks who make blindingly fast dedicated graphics processors that draw high-resolution pictures on your laptop, has decided to convert some of its graphics chips into general purpose processors called general purpose graphics processing units (GP-GPUs). NVIDIA has recently released its latest set of GP-GPU devices that potentially have compute capability in the teraflop (trillion floating-point operations per second) range.

Even FPGA companies like Xilinx and Altera are getting into the HPC act by introducing new devices that combine FPGA fabrics and multicore CPUs. Devices such as the new Zynq-7000 from Xilinx offer a dual-core ARM A9 processor with the Neon vector accelerator, plus an extensive FPGA fabric providing developers the ability to roll their own custom hardware accelerator for specialized HPC applications. Traditional HPC engines, like DSP processors, are not going away. A quick a look at TI's latest offerings reveal the Integra line, which features a C6x DSP core and an ARM Cortex A8 CPU on the same chip.

Clearly the number and diversity of new products entering the market today are making a rich array of devices available to the embedded developer. There is a tremendous amount of embedded computing horsepower out there that dwarfs what was once available in those oversized supercomputing rooms just a few years ago.

This sounds like a wonderful new world; a plethora of powerful computing devices you can just hook together to meet your HPC needs. The problem is that programming some of these specialized devices like DSPs or GP-GPUs for maximum performance is not at all trivial. Things are further complicated by the fact that modern HPC platforms are likely to contain a heterogeneous mixture of processing cores, such as an FPGA, a GP-GPU, or a DSP combined with a single or multicore CPU. Programming a heterogeneous compute platform becomes even more difficult. As a result of this complexity, typical embedded software development costs are exceeding well over 50% of the entire system cost.

In addition to achieving maximum compute performance, the other key problem area that developers face is the idea of portability. If project leaders are going to have their software teams develop applications that are capable of image recognition, they don't want to reinvent that activity every time a new platform is announced. Yet traditionally, for high-performance computing, to get things to run very, very fast–especially on small embedded systems–the order of the day was hand coding in assembly language. This was laboriously done by coding down at the bit level to make sure the code ran as fast as possible. Needless to say, that kind of labor, especially on more sophisticated applications, is prohibitive. It is both very expensive and certainly not very portable. If somebody is going to invest in this space by building wonderful new applications, they will want their application to be portable and easily map to any new hardware. This is especially true nowadays when it seems that every six months a new type of device is announced. So portability means at least future-proofing software development against the roadmap of the hardware vendor. Because software costs are rising, anything  that can be done to help software developers be more productive and get them out of the business of hand tuning low-level routines will reduce development costs and allow more time to concentrate on the high-level features.

Two key components are required to make this happen. One is a good interface between the hardware and software development; this often takes the form of hardware-specific, low-level libraries. These low-level libraries have to be tuned specifically for maximum performance on a given processor. They're basically the functions that enable a particular piece of silicon to execute a simple instruction like “add these two numbers” and do it as quickly as possible, leveraging powerful accelerators that might be available for that given data type.

But that's still not enough. At some point, embedded systems developers have to compose an image recognition system out of a series of add, multiply, subtract, and divide operations while simultaneously handling all the data details such as various data types, storage location, storage arrangement in memory (called stride), and memory copy operations to ensure the right data is available where and when it's needed. Ideally, you would have a high-level portable library that provides high-level function abstraction and data encapsulation that sits atop those low-level hardware-specific libraries. Such an approach basically allows you to write algorithms in software very much as you would in a specification document or on a chalkboard.

For example, a convolution algorithm, often used in image processing, requires taking the Fast Fourier Transform (FFT) of the input data, multiplying that result by a weighting vector, and taking the inverse FFT to produce the end result. Ideally, this would be expressed in only a few lines of code that works for different data types. If you try to do that in assembly language and deal with all the computation and data manipulation that has to happen, you might end up with several hundred or even a thousand lines of code.

Click on image to enlarge.

When you get right down to it, it's all about having an optimum software stack riding on top of the hardware, as shown in Figure 2 . The low-level, hardware-specific library is the foundation of the house. If you view the processor as the lot, then the low-level library becomes the bricks that are going to hold up the house. To carry the analogy further, we then have to build the house itself, and while we could write a lot of our own software, we'll be more efficient if we use a high-level library such as Sourcery VSIPL++ from Mentor Graphics, which basically helps move the level of abstraction from that low-level library much closer to the realm of our application. This helps embedded systems developers get the extra performance, portability, and productivity they need. The two key features that enable a high-level library such as VSIPL++ to provide portability and performance are a standard function API and a built-in optimized dispatch capability that maps high-level calls to the optimum performing low-level resource. Figure 3 shows the architecture of a full-featured, high-level HPC library.

Click on image to enlarge.

When looking at HPC, it's both the hardware and the software that enable smarter, more powerful devices at lower costs. It wasn't too long ago that the black plastic phone was the standard in telecommunications. We now have amazing devices such as the Android phone and Apple iPhone. If we extrapolate that evolution over the next dozen years or so, we will likely see an explosion of compute-intensive applications on a variety of low-cost platforms that will blow our minds with what they're capable of doing. Even better, these devices will be available for all to enjoy.

Pete Decher is the business development manager for High Performance Computing Solutions at Mentor's Embedded Software Division. Pete has over 30 years of experience in software development, electronic system design and test. 

This article provided courtesy of Embedded.com and Embedded Systems Design Magazine. Sign up for subscriptions and newsletters. Copyright © 2011 UBM–All rights reserved.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.