CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

MIPS uses "Virtual CPU" design to blunt rush to multicores



Embedded.com
Mountain View, Ca. - MIPS Technologies introduced this week its next generation MIPS34K core with a new "virtual CPU" architecture that it believes can reduce or delay the need to move to more complex multicore designs in a variety of multimedia-based consumer devices, home entertainment systems and network applications.

Essentially a superset of the previous MIPS24KE with DSP extensions, the 90-nanometer 500 MHz 32-bit MIPS34K core is an innovative approach that incorporates several hardware ‘virtual processing elements’ and an optional quality-of-service (QoS) logic block for real-time deterministic operation.

According to Vivek Sardana, MIPS34K product marketing manager, the combination should make possible as much as a two-fold performance improvement over the current MIPS24KE core in embedded consumer applications requiring a mix of DSP and RISC operations and the use of more than one operating system.

Darren Jones, MIPS34K engineering director, said that in tests the company has done comparing the new core to an earlier generation 625 MHz MIPS24KE running several EEMBC benchmarks sequentially, the 500 MHz virtual CPU-based MIPS34K ran 60 percent faster, running the benchmarks in parallel. That speedup was with just two threads, with little effect on the caches.

To achieve this the company has taken an approach radically different from some processor core makers, such as Atmel, which optimizes the instruction set and pipeline for multimedia applications, the hybrid DSP/RISC design from Analog Devices, or the multicore approach taken in the new ARM MPCore synthesizable multiprocessor.

Kenton Williston, DSP analyst at Berkeley Design Technology, Inc., said that MIPS has come up with an approach that minimizes the effect of the one major bottleneck plaguing almost every microprocessor designed: the inherent inefficiency of the pipeline with regard to thread misses caused by memory latencies, pipeline stalls and other factors.

“A fact of life for most microprocessor architects is that instructions are not normally issued for each and every cycle,” said John Carbone, vice president of marketing at RTOS vendor ExpressLogic. “In the real world, a lot of time is wasted on cycles executing with no data available because a cache line is loading or the CPU is fixing a cache miss.”

Processor designers have always been aware that they were not making use of the pipeline efficiently and have constantly worked on ways to keep the pipeline full so there were no delays. But the problem in many mobile and embedded consumer applications is that they were too costly in terms of silicon: multiple large multilevel caches to store data temporarily for delivery to the pipeline or the use of two-, three-, or four-way set-associative cache designs. The first requires costly silicon for additional memory elements while the second requires additional logic for the flexible cache structures.

To recapture those missed cycles and do so with a minimum of extra silicon, what MIPS has done with its multithreading architecture is maintain multiple contexts in hardware so that when there is a missed cycle, the processor can switch to another context and make use of the empty slot in the processor pipeline.

In the MIPS34K core, there are two virtual processing elements (VPE0 and VPE1), containing a total of five Thread Context (TC) blocks. Jones said the best way to describe a VPE is as an instantiation of the OS-visible state of the MIPS32 architecture, while each TC is a replication in hardware of MIPS32’s user state application programming model. What they share in common are all the other elements that characterize a full featured processor (fetch and decode logic, the pipelines and caches).

But to the application and the OSes, each VPE and/or TC looks like a fully featured CPU, which allows it to handle multiple OSes, processes and threads concurrently. And since the VPEs share the same cache, the MIPS multithreading VPE design is inherently cache-coherent, which multicores have problems maintaining. The memory subsystem is also improved: where a 24KE core can deal with no more than four outstanding load misses, the 34K can handle eight.

“It’s common practice to have lots of tasks running on a single core and have it switch between them,” said Williston of BDTI. “The key advantage to the 34K is that it provides hardware to reduce the cost of switching between tasks to essentially zero, not counting the costs of any cache misses that might occur.”

The VPE/TC structure seems also able to reduce cache misses to a minimum and give the programmer -- and the software -- the means by which to fill an open slot in the pipeline by switching to another context and inserting, with no delays, another task to be performed.

“Generally, a good half the cycles in almost any microprocessor design are lost to inefficiencies in the pipeline or due to memory access latencies,” said Carbone. “If the MIPS VPE/TC structures can capture a good portion of those wasted cycles you have literally doubled the performance of the processor with no additional cores, pipelines or higher clock rates, and at considerably lower power consumption than other approaches.”

Another compelling aspect of the MIPS approach, said Williston, is that the combination of the VPE, QoS and the DSP extensions will allow many designs which currently require separate RISC/DSP processors and separate OSes to run on the same core. “In the past general purpose tasks such as running Linux, a GUI, or a network stack, for example, usually ran on separate processors. Now it is more common to run both types of tasks on the same core. However, this can lead to some very complicated programming models.”

From this point of view, the 34K is very appealing because it lets the developer run both tasks on the same core, but allows the developer to run separate OS environments for general purpose and DSP tasks, one running on each VPE.

MIPS engineers have tested or simulated the operation of the core in a variety of OS environments and in both symmetric (SMP) and asymmetric (AMP) multiprocessing configurations—such as a multifunction printer running Linux using one TC and one VPE for print operations and another set for scan operations, and a set top box using VxWorks running on one TC, one VPE for the user interface in the control plane, and Nucleus RTOS running in the data plane on another VPE/TC set executing audio codec operations.

A particular bonus for embedded developers dealing with deterministic real time applications, is the optional QoS module that can be incorporated into the 34K core. “That allows a developer to ensure that real time tasks get enough processing time,” said Williston of BDTI. “This is an important feature because it makes it much easier to put DSP tasks, which are usually real time, on the same core as general purpose tasks. “

That said, Williston does not believe that the MIPS34K will be the right solution for all multimedia and network intensive applications. “In some cases the extra complexity won’t be worth the performance gain; in others, you won’t get much of a performance gain. A key unanswered question is: how easy will it be to use MIPS multithreading virtual processing element architecture? It is likely to be harder than MIPS would like us to believe.”

However, Express Logic’s Carbone has few such doubts, based on the experience that his company has had with adapting its ThreadX RTOS to the MIPS34K core as well as to the ARM MPCore and to Analog’s converged DSP/RISC design. “A developer building an application on a properly configured RTOS and thread aware tools should be able to port exactly the same code from the previous generation 24K MIPS core and run it with no problems and get some significant increases in performance.”

Carbone added that there are bigger questions relating to how extensible the new approach is: Can the number of TCs and VPEs be increased without substantially increasing the die area of the core? And can the approach be of benefit in designs which incorporate multiple VPE/TC-enabled 34K cores?

“Think about it: two or three MIPS34Ks on a single chip with two VPEs and five TCs per core would certainly have enough performance headroom to reduce the pressure to go to, say, five or 10 traditional cores,” he said. “Whether that is a viable option is a question that the developer community will probably have to answer.”

While MIPS plans to look at such alternatives in the future, for the near term it is keeping its focus on providing developers with an alternative to multiple cores to achieve the performance increases they need in current applications.

Other new features of the MIPS34K core include a nine-stage single issue pipeline; an optional policy manager for the QoS block; an optional inter-thread communication unit; an optional floating point unit; and several new instruction set extensions to allow OSes and programmers to take more advantage of the virtual processing element capabilities.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :