Can CUDA language open up parallel processing? - Embedded.com

Can CUDA language open up parallel processing?

LONDON — More universities should provide courses on programming for massively parallel computing and more graphic processor (GPUs) providers should look at enabling the use of the Compute Unified Device Architecture (CUDA) programming language on their devices according David Kirk, chief scientist at Nvidia Corp. (Santa Clara, Calif.)

Nvidia developed CUDA to run on its own GPUs and has been finding remarkable traction for the language in applying those GPUs to applications in finance and other multivariant analyses. Indeed some observers have argued that, at the very least, CUDA is starting to influence academic debates over parallel processing languages and hardware architectures.

“Massively parallel computing is an enormous change and it will create drastic reductions in time-to-discovery in science because computational experimentation is a third paradigm of research – alongside theory and traditional experimentation. If you can make one of these paradigms much faster it will have a tremendous impact,” said Kirk, speaking in a lecture in front of an invited audience at Imperial College, London.

And Kirk believes that massively parallel computing has another impact in that it can democratize supercomputing. “For $2000 a teraflop anyone can have a single-precision supercomputer in their PC. Very soon we will have double-precision floating point designs and I predict that within less than two years you will see pentaflop clusters of double-precision floating point designs for under $5 million and this is really mass-market supercomputing. It takes cost as the main barrier out of computing for science. This is a once in a career opportunity for people in the computing industry.”

Kirk concedes there is a lot more to be done. “We need much more research in parallel computing models. CUDA is great but it is not the final answer. We need more research in parallel architecture and hardware now that we realize that everything has to be parallel. Kirk has been teaching a class at the University of Illinois for a couple of semesters aimed at scientists and engineers, not particularly computer scientists but those who need to use computers and can benefit from an experimental course in programming massively parallel processors for general computation.

At present round 100 universities worldwide provide courses based on the material used in Illinois (see here for more info) but said Kirk “We need to teach massively parallel computing to everybody in science, not just computer scientists and electrical engineers.”Nvdia released a CUDA software developers kit in 2007 to enable development of algorithms using C program to run both GPU and CPU functions on its processors. This enabled non-graphics parallel computation as well as traditional graphics-rendering computations to run on these processors.

All Nvidia GPUs introduced since the fall of 2006 have supported CUDA with 70 million devices now shipped and by the end of this year around 200 million CUDA-enabled GPUs will be in consumer and professional computers. Nvidia has even released the Quadro family of cards and workstations for computer-aided design, digital content creation, and visualization applications that are not built mainly for graphics and have only a PCI-Express interface.

CUDA has already been used in commercial applications related to oil and gas exploration, computational finance and other computational modeling projects, as well as for faster, more powerful hybrid rendering within graphics.

Evolved Machines is a Stanford University startup that uses neural network simulation to allow simulation of electrochemical nerve cells. It does this for a number of applications, including olfactory sensory processing and visual object recognition and it is using the CUDA language running on Nvidia GPUs to provide a more than 100x speed up in computation.

Hanweck Associates (New York, NY), a consulting firm specializing in investment and risk management for financial institutions, has used Nvidia Tesla products and says some of its clients are achieving a 100x speed up of Monte Carlo simulations compared with single-core processing, which helps traders see what is happening in the markets.

“In this area time really is money,” said Kirk. “Designs move from research to commercial implementation very quickly. Hanweck has built an engine that does real-time calculations of the volatility of the entire U.S. equity options market and can provide an update in under a second.”

Kirk explained that in astrophysics research much of what can be done is only though computational experiments. General purpose CPUs, even with gigahertz clock frequencies, have trouble providing gigaflop performance whereas an Nvidia GeForce8800 can provide more than 300-Gflops.

“This is faster than the GRAPE-6AF custom supercomputer that what built in Japan to handle this type of application. In this field if you can calculate things 300 times faster you are going to do all the ‘big science,’ get in all the publications and learn everything first. Within six months of the first paper being published based on using a GPU there was an international conference so that everyone could learn how to use the technology and now it has pretty well taken over this area,” said Kirk.

Nvidia is doing its best to make CUDA more open by providing code downloads, by producing a tool set built under Open 64 – The Open Research Compiler and the company is in the process of making much of the CUDA language itself open source said Kirk.

“We are doing virtually everything we can to make it a standard.” At present, GPUs from rival company ATI cannot run C and hence cannot run CUDA, said Kirk. “Their hardware cannot share data between different threads and I would encourage them to implement hardware that can do that. If they did we could port CUDA to run on their processors.”

In addition, the definition of the CUDA language does not preclude it running on more general-purpose multicore CPUs, said Kirk. Several universities have done projects where they have taken CUDA source code and compiled it to run on multicore CPUs. At present most multicore CPUs run code much more slowly than GPUs do, because there are not enough arithmetic logic units (ALUs), but the interesting thing explained Kirk, and one that he was not expecting, is that in almost all cases the CUDA version of the code running on the CPU scales better for multiple processors than hand-coded multicore compiled version of the same program.

“CUDA parallelization does a better job at linear scaling than people hand-coding for multicore,” said Kirk. “I expect that you will see products that will allow CUDA to run on multicores as well as enabling load balancing across the two types of processor because what you really want to do is use all the processors in your system.”

So why wouldn’t you merge the GPU and CPU in to one piece of silicon?

“A hybrid device would do both kinds of task poorly. A device with a mix of GPU and CPU cores would be a great product for the low-end where you do not care about performance but most of our customers want more computing power so I wouldn’t want to take any of my silicon and use it for a CPU.” GPU computing with CUDA is a approach where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems up to 100 times faster than traditional approaches.

CUDA avoids the limitations of traditional GPU stream computing by enabling GPU processor cores to communicate, synchronize, and share data.

CUDA-enabled GPUs offer dedicated features for computing, including the parallel data cache, which allows 128, 1.35GHz processor cores in the latest Nvidia GPUs to cooperate with each other while performing intricate computations. Developers access these features through a separate computing driver that communicates with DirectX and OpenGL, and the Nvidia C compiler for the GPU, which obsoletes streaming languages for GPU computing.

A CUDA-enabled GPU operates as either a flexible thread processor, where thousands of computing programs called threads work together to solve complex problems, or as a streaming processor in specific applications such as imaging where threads do not communicate. CUDA-enabled applications use the GPU for fine-grained data-intensive processing, and the multicore CPUs for complicated coarse-grained tasks such as control and data management.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.