Making life easier for multicore SoC software developers -

Making life easier for multicore SoC software developers


Putting multiple processors on a single chip or on a single board has enabled embedded system hardware designers to provide more features and higher processing speeds using less power, thus solving many design problems. But for software developers – and vendors – this trend presents a daunting set of challenges.

In the embedded environment, the developer is no longer dealing with the familiar balanced, homogeneous and symmetric multiprocessing (SMP) model of servers and large computer systems. Rather, the designer of an embedded or mobile device now may have two, three or four processors to program and debug, a heterogeneous and unbalanced mix of RISC, DSP and network architectures, operating asymmetrically.

Hardware developers at recent conferences have bemoaned the lack of tools and building blocks that are up to the challenge of doing software development in multicore environments. However, many embedded software vendors believe they have solutions to most of the immediate problems. The big question is where to go next as the number of multicores and the heterogeneity of such designs increases.

ARM’s MPCore is a synthesizable multiprocessor based on the ARMv6 architecture which can be be configured to contain between one and four processors delivering up to 2600 Dhrystone MIPS of performance. (Source: Express Logic)

“As more processing elements are embedded into the silicon, the level of software complexity is outstripping the capability of traditional embedded software tools to efficiently develop application software code and to manage the system,” said Sven Brehmer, president and CEO at Polycore Software.

However, according to Robert O’Shanna, software engineering manager at Texas Instruments’ DSP Group, the problem may not be a lack of tools and building blocks but rather too many of them. “With diverse solutions and no commonalities, no industry standards on methods and procedures,” he said, “vendors are beginning to appear with a wide range of OS and hardware specific solutions. And those that offer some degree of platform independence require the use of frameworks and methodologies that are unfamiliar to many developers.“

Answering the questions about tools
Tool vendors are responding to a number of questions that developers have been asking. At the OS level in a multiple CPU environment, and at the application level, how do you efficiently write code for multiple targets? At the OS level, what is the best mechanism for managing multiple targets multiple OSes on multiple CPUs? At the support level, how do you debug in this environment?

For the first two questions, the answer is multiple-choice: the use of message-passing mechanisms to hide the multiple CPUs and the multiple partitions that must be managed; the use of a common API, which the programmer uses as the target; development or adaptation of system-level modeling tools to sort out the software implementation complexities; and a shift from traditional and familiar sequential procedural programming languages to functional programming and a move to higher levels of abstraction.

According to Joseph Dubin, product manager at the Freescale Semiconductor Developer Technology Organization, the issues, while complex, are not unfamiliar and have been faced before. “Multiple CPU environments are common at the board and chassis level and numerous techniques for managing software in that environment have emerged,” he said. “What is different now is that some applications such as handheld and some embedded consumer applications are turning to the use of multiple CPUs on the same chip. While it will require some modifications of these techniques, there is a clear road path with few unknown problems that cannot be addressed.”

Brehmer, at Polycore, on the other hand, believes that the problems represented by the small-footprint embedded consumer devices shifting to the use of multiple and heterogeneous processors represent a set of problems that are different not only in terms of degree of difficulty but in kind.

“While there are a number of applications in networking and blade environments where multiprocessor environments are possible with existing tools,” he said. “Because of power requirements, many embedded designs in consumer and mobile are constrained to operate using an asymmetric model to get most efficient use of the silicon. Additionally, they by necessity require different types of processors and multiple instantiations of RISC and DSP, which are difficult to operate in an SMP mode.”

In most multiprocessor SoC designs, the cores have separate level 1 caches, but share a level 2 cache, memory subsystem, interrupt subsystem, and peripherals, requiring that the system designer give each core exclusive access to certain resources and ensure that apps running on one core don’t access resources dedicated to the other. (Source, QNX)

There is also the problem of software partitioning, crucial in multiprocessor designs. “Too much communication between multiple CPUs negate the advantages of multiprocessing,” said Brehmer. “Currently, most software partitioning in mainstream applications assumes a subprocessor configuration in which the application space is divided into two main parts—the control and user interface and non-stop signal processing.”

Such models are built around shared memory architectures in which partitioning is done at an early stage. In many consumer designs, partitioning assumes the RISC engine is the main processor with the DSP configured as a peripheral. “The disadvantage is that although the subprocessor (the DSP) has a substantial portion of the computing tasks, it is blocked more often, waiting for commands from the main processor, negating much of the advantage of parallel processing offered by the use of multiple processors.”

John Carbone, vice president of marketing at Express Logic acknowledges that there are some real issues facing developers long term that the embedded software industry needs to address, as application needs and limits on fabrication technology force the use of more than two or three CPUs on a chips.

“But near term, there is much we can do with what we know now,” he said, “by selecting carefully the hardware platforms and the programming models we use.”

For example, while there are a wide range of applications which will require the use of new asymmetric multiprocessing programming models (AMP) in which there are heterogeneous CPUs tasked to specific kinds of operations, there are many situations in embedded mobile applications in which the more well understood symmetric multiprocessor programming model used in servers and large computing systems can be employed.

A problem in the heterogeneous environment with a mix of DSP and RISC engines is that it is difficult to employ SMP to balance loads, sharing application processing over multiple CPUs. At the hardware level, Carbone said, this problem is ameliorated by the emergence of two trends: (1) the creation by companies such as Analog Devices, Atmel, Freescale and Texas Instruments of so-called converged or hybrid architectures that combine DSP and RISC operations into a single architecture; and (2) the use of brute force RISC architectures with some additional DSP instructions and pipelining in which MIPS are thrown at the application.

SMP, AMP, or some combination?
In both cases, an SMP environment can be created in which the programmer can assign tasks without being concerned whether they need to be targeted at RISC cores or DSP cores. In the second case, by throwing three or four CPUs at a problem, many DSP problems can be addressed, albeit less efficiently: in a three core design, for example, performance improvements will be less, on the order of 1.5 to 2 rather than 3-fold.

In such quasi-SMP environments, existing OSes and RTOSes can be retained, with some modifications. Within Express Logic, Carbone said, enhancements have been incorporated into its ThreadX RTOS that allow it to operate in two, three and four CPU SMP environments in which one CPU is assigned a “first among equals” status as the gateway through which all operations are conducted and the other CPUs take on tasks on an as needed basis. If the first CPU is busy, it hands off tasks to any CPU that is available.

While it does not yield the performance and power consumption improvements that a fully tooled AMP environment could theoretically yield, he said, the modified SMP configuration that ThreadX allows does seem to yield improvements on the order of 60 to 85 percent, without requiring that the programmer abandon the familiar single CPU, single programming environment.

“Not all RTOSes are amenable to this kind of modification,” said Dave Kleidermacher, vice president of engineering, Green Hills Software. “If they are top heavy with services and functions and already pushing hard to maintain deterministic operation they will have no headroom to handle the modifications that must be made. What you need are spare, lean RTOSes with extremely small kernels and with thread and interrupt structures well adapted to operating in complex and deterministic hard real time environments.”

Eventually, said Express Logic’s Carbone, the embedded software industry will have to address the hard issues and develop tools appropriate to writing code that operate efficiently in heterogeneous, asymmetric computing environments. “We have no choice about it. Where the builders of the CPU hardware go, we must follow, with the tools and frameworks that allow developers the simplicity of a single CPU program model but allow development of code that efficiently operates on any CPU in a multiprocessing environment.”

Message-passing frameworks emerge
However, until something better comes along the trend is toward the use of a message-passing middleware framework that manages transactions amongst multiple CPUs and multiple RTOSes, or multiple instantiations or images of a single RTOS, but presenting the developer a single common API to which to program.

QNX Software Systems offers its messaging-base Balanced Multiprocessing (BMP) Environment; Enea Embedded Technology has an upcoming version of its’ AMP-optimised RTOS that uses its’ Element message passing framework; and PolyCore has its RTOS-independent PolyMessenger framework. Other companies such as Wind River and a number of companies in the Linux community are opting for a middleware message-passing middleware standard called the Transparent Inter Process Communication (TIPC), an Internet TCP/IP derived protocol originally designed to aid communications between clusters of SMP-based servers.

According to Tom Barrett, president of Quadros Systems, Inc., as multicore and multiprocessor-based SoCs become more common, with multiple tasks per core, message-based communications systems help to manage the complexity. “This approach also frees the application developer from working on the hardware, allowing him or her to focus on the actual application code. This because an IPC framework using messages shields the application from the hardware and at the same time bridges different processor types running different OSes.”

The model PolyCore favors, said Brehmer, is a peer-to-peer framework in which cores are independent of each other and work in parallel. This results in higher performance as each of the cores can continue to work even if another one is waiting for data. Also, by recompiling a task onto a different core, developers can test different partitioning to determine the best configuration.

Working to stay ahead of its competition in this area, the company is now working on a next-generation peer-to-peer mechanism that allows a developer to write code in the normal sequential manner and automatically split it by specifying split points in a separate configuration file. This would allow parts of the code to run in parallel, with communications handled automatically by the system at run time. “This would allow the developer to continue using standard C, while the configuration, separate from the program, allows for different ways to partition the code,” said Brehmer.

Finding the right debug tools for multicores
A question that is the subject of much debate is whether or not the current generation of software tools and building blocks are appropriate to the new application software code needs of multicore SoCs and multiprocessor-based designs. Of particular concern to Carbone at Express Logic are debug tools. “As complex as issues of code development and system partitioning are,” he said, “they are nothing like the can of worms opened up at the debug level. Multiple CPUs means multiple OSes or instantiations of the same OS, debug issues on each target and in the middleware development used to manage and coordinate operations amongst multiple CPUs.”

Dubin at Freescale, on the other hand, believes that many of the code development and debug issues facing developers in the current generation of multicore chip designs are solvable, if not with current generations of tools, at least with straight-line extrapolations of existing techniques.

TI’s O’Shanna walks the line between these points of view. While current debug methodologies are adequate for the short term and can be extended to address many of the problems that will arise in the short term, he said, they will be inadequate as the number of processors increase. “We are already seeing designs with two and three processors on a chip, and a number of CPU architecture licensees are already talking about SoC designs with six and seven cores on a chip.”

But O’Shanna believes it is not the software vendors who will need to solve this problem, but the hardware suppliers. Indeed, the entire industry will have to come up with extrapolations of JTAG and the NEXUS debug interface specs that are now the standards. “We just do not have sufficient visibility inside the chip to allow us to collect the information we need to debug in such a complex environment,” he said. “At TI we are actively pursuing a number of alternatives, but it is going to take industry wide effort to solve this vexing problem.”

Efforts seem to be moving in the direction of dealing with some of O’Shanna’s concerns. Under way in the Eclipse community, said Maarten Koning, principal technologist, office of the CTO at Wind River Systems, Inc., is a subproject under a Device Software Development Platform (DSDP) effort proposed by the company devoted to the creation of a common debug model. It is aimed at coming up with interfaces and views that will work with the many different debug engines that support conventional RISC, digital signal (DSP) and network processors. Brehmer at PolyCore said that in its beginning stages is an still-forming fledging effort called the Multi Processing 'Aficionados', who are currently debating the viability of some common industry-wide standards and APIs for multiprocessing.

They are also planning to hold a three-day conference to educate the engineering community and engineering executives about multicore and multiprocessor designs – how to implement them, how to use the available software tools and operating systems to produce the application, and how to interpret and prepare for the upcoming trends.

Platform-specific or platform-independent tools?
All this effort may come to naught, believes Koursh Amiri, director of product development at multicore DSP vendor Cradle Technologies, Inc., who has come to the conclusion that platform-independent tools and methodologies are useful only up to a point; in the next generation, it may be necessary to more tightly link them to the specific hardware, even to specific applications.

“While it will be possible to use familiar languages, methodologies and development tools, they will require multicore hooks specific to the underlying architecture and to the application,” he said. “In debugging, for example, the typical approach—one processor, one OS, one tool—is impractical.”

In the multi-core environment, said Amiri, it is necessary to place multiple break points and have a specific processor, or all of the processors, respond to that breakpoint.

“In some cases the developer will need to know which processor or processors have hit a breakpoint,” he said. “And after hitting a breakpoint, developers will sometimes want the one core to single step, or all the cores do so in unison, or each in turn,” he said. “During the process of performance analysis, in addition to the traditional metrics (such as the number of cycles spent on a procedure and the number each line of code took), multicore developers will also need to understand processor to processor interactions, and take into account the time spent waiting and memory utilization.”

All of this will require a much tighter linkage between specific tools and specific hardware architectures. Given this necessity for such tight linkages, Amiri is not surprised that companies such as TI and Freescale have put a lot of effort in developing tools specific to their architectures and acquiring software companies (such as Metrowerks by Freescale), to give them that capability as they proceed to second generation multicore SoC designs.

For more technical insights on these issues, go to More about multicores, multiprocessors, and tools

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.