Virtual multi-cores simplify real-time system design - Embedded.com

Virtual multi-cores simplify real-time system design

It is difficult to implement a combination of multiple real-time tasks on a traditional processor. For this reason an FPGA and hardware design techniques have typically been used instead. At the same time, multi core design methods have become familiar to many through the use of multiple microcontrollers and processors in order to construct real-time systems.

In this article we discuss the key properties of emerging multi-core systems that make them an appealing and elegant platform for complex real-time tasks.

Real-time hardware
From the design perspective, the biggest single advantage of using a hardware implementation is that hardware offers composability. In short, composability means that when there are two functional blocks, F and G, that perform their task in times Tf and Tg, then there are simple mathematical rules that allow us to predict the time and resources are required to perform both F and G, either in series or in parallel.

As a first approximation, when F and G are performed in parallel on hardware, then this will take a time max(Tf,Tg). If F and G are performed in sequence in a hardware implementation, then time will be Tf+Tg. Similar simple rules can be devised for the resources required; if the two functional blocks are implemented using resources Rf and Rg, then the parallel composition will requires resources Rf+Rg, and sequential composition will require at most Rf+Rg resources.

It is these simple models that allow us to apply sound reasoning to hardware designs, and compose designs out of basic blocks. Real-time behaviour can to a large extent be predicted in advance, provided that the resources are available. Hardware can be scaled up, for example by using a larger FPGA; we size the hardware to suit the task at hand. A consequence of this is that FPGA manufacturers have to sell devices containing different numbers of cells, so that embedded systems designers can use the smallest FPGA that suits their needs, thereby minimising costs.

Physical constraints and layout tools mean that composability has a limit though. When laying out two parallel tasks, routing and placement may cause either task F or G to slow down, or use more resources than expected.

Real-time software
A single real-time task can easily be written on a software based system that uses a general purpose processor. Even without any dedicated hardware support data can be read or produced in real-time by calibrating the software speed with the required hardware speed. As an example, one of the first cheap home computers, the ZX-80, used this principle to generate TV-output using just a Z80 microprocessor and employing no external hardware.

However, there is not always a simple model for composing two or more real-time software components. Running two real-time tasks in a software-only environment is comparatively complex and usually requires some form of RTOS, and requires the tasks to cooperate with each other. In other words, task F may have to be redefined in the light of requirements of task G; which means that composing two tasks on a single processor is not as easy as it is on an FPGA.

The most difficult aspect is trying to predict the performance of the combined task. If processors have components that work on statistical principles, for example a cache that usually contains the required data, then performance prediction is virtually impossible without special partitioning hardware.

Indeed, if two jobs with large memory footprints are executed on a dual-core processor, both tasks can run significantly slower, simply because they are trashing each other's cache footprint.

Alternatively, it may turn out that both run as if they have the cache to themselves, because neither of them needs the cache. Similarly, the time taken by scheduling algorithms is not completely predictable, although the effect of this is usually not visible unless the program requires lots of small tasks to be completed. It is the lack of predictability that causes the system not to be composable. Together these factors affect the programmability of the system.

A particular problem when running multiple software real-time tasks is to precisely time I/O signals. It is usually possible that two I/O devices will demand the processor's attention at (almost) the same time, and even though the real-time task will know exactly when the first input signal came in, it will not be able to time both signals without additional hardware in the I/O devices.

Compared with the FPGA solution, software solutions scale in a different way. The amount of memory can usually be adjusted based on the needs of the task. The amount of processing power of course is usually preset, in that one buys a 2.4 GHz pentium, or a 20 Mhz PIC.

The trial and error nature of the process of composition means that the designer has to iterate through the design cycle many times. The biggest advantage of software solutions is that the design cycle is very short, but this advantage can easily be negated by the need to iterate many times.

Multi-core
The composability properties of software can be improved dramatically by viewing a processor as a much smaller and cheaper entity that can, like hardware, be scaled up. In this model each real-time task is designed in software, but instead of those tasks sharing a processor using some form of RTOS, the tasks are each allocated their own dedicated hardware either in the form of physical cores, or real-time threads.

Figure 1. A single XCore processor tile.

If each core has guaranteed real-time properties, then performance prediction becomes trivial by using more cores. This philosophy is close to the way that microcontrollers are used – very often a board contains several microcontrollers that each take care of a specific real-time task defined in software.

However, it is usually not cost effective to allocate a single core per real time task, and hence we can share a physical core between a number of threads; each of which is a virtual core that has a guaranteed slice of the performance of the physical core. (We will use the term core throughout to refer to either physical or virtual cores.)

An example core that can run multiple real time threads is the XCore processor, shown in Figure 1 above , which can run up to eight real-time tasks. As shown in Figure 2 below , multiple tiles are connected by means of a switch, and multiple switches can be connected together to form a network of cores; N cores can run 8N real time tasks.

As in a hardware design-flow, the designer can size a multi-core solution by acquiring extra cores in order to match the required processing resources. This combines the short development cycle that software offers, with the simple performance model and sizing properties that hardware offers. The cores are combined in a manner that suits the problem solved. The network is constructed so that its structure closely matches the data-flow between the real time tasks.

Figure 2. A network of XCore tiles.

In this model, tasks are allocated to the cores by the designer, taking away a large part of the unpredictability of placement at the expense of putting a small burden on the designer. The rationale is that the designer knows how the cores interact, and hence how they ought to be placed. It is relatively little work since there are typically only a dozen tightly coupled tasks.

A multi-core solution is then very appealing: if the system designer can assume that cores are expendable, (in that cores are cheap, low power, independent, and easily integrated) then designers can express a real-time system as a collection of software real-time tasks.

Each core must guarantee that it can run at least one task with hard real-time properties. It is the combination of a multitude of low-power cores that implements the complete system.

Similar to buying an FPGA of the right size, this method of design scales by acquiring a system with the right number of (virtual) cores. For example, for an MP3 player one may require five cores, one of them running the flash memory interface, one running a decoder, two cores for equalisers on left and right channels, and one running the user-interface.

Each core can be fairly low performance, for example, the two equaliser tasks may require 50 MIPS each. If the physical hardware offers four virtual cores per physical core, then two physical cores are required.

The multi-core solution offers a series of extra advantages: errors can be contained, designs can be split and made future proof, and timing of I/O becomes easier.

Containment of errors
An important benefit of a multi-core solution is that it provides a natural boundary for containing hardware and software errors. On machines with a traditional operating system (such as UNIX for example), the operating system ensures a process that fails will not affect other processes. For example, a program that tries to write to a memory address which is not part of its address space will be sent a signal, and terminated.

In systems that do not provide memory protection, any misbehaving task can, either accidentally or on purpose, destroy other tasks. For example, a bit error in a received memory address may make a task write into the stack of another process, propagating the error through the system, causing non deterministic behaviour in a growing number of threads.

This is unacceptable in safety critical systems, but also highly undesirable in many systems that are not safety critical. Taking the example of the MP3 player shown earlier: it would be undesirable if an error in any process could cause the flash-memory process to reformat the flash memory device.

Even though the designer cannot make it impossible for an error to cause the flash-memory to be formatted, they can make it less likely by containing errors inside each task. If one of the equalisers fails, one of the channels will stop playing, and action can be taken to recover from that error, for example by restarting the whole system, or maybe by restarting the failed task from a checkpoint.

Like a multi-core solution, hardware systems can be designed so that errors are contained. This is achieved by explicitly designing the interface between the hardware components to stop errors from propagating through the system.

Future proof designs
Defining a complex real-time task as a collection of small real-time tasks has an additional benefit in that each of the subtasks is specified as an individual timed process. The task has a specification that defines its functionality, communication pattern, and timing behaviour. The software implements the functionality and communication, and is designed to run fast enough to adhere to the timing requirements.

Because of the strict separation of tasks, any task can at a later stage be replaced with a different implementation that implements the same communication pattern, and fulfils the same timing constraints. This new implementation may, for example, have improved functionality or use less power. The replacement implementation can be slotted into the existing system without affecting any of the other tasks.

Figure 3. Design example: a complex MP3 player can be split into nine tasks.

As an example, we study the design of a slightly more complicated MP3 player, that is designed for in-car use and adapts its equalisation and amplitude based on ambient noise levels. This task logically splits into nine tasks (as shown in Figure 3 above ):

* Flash interface: reads files from flash memory using some SPIO interface, and writes data to flash memory when tracks are uploaded.* USB interface: reads commands from the USB bus and transfers data to and from the flash interface.
* MP3 decoder: gets blocks from the flash interface and decodes them into two streams of samples (left and right channel).
* Ambient sound sampler: samples a microphone at regular intervals* Discrete Fourier Transform: computes the energy levels in each band
* Left and right equaliser: selectively amplifies frequency bands in the left channel based on the noise level.
* Left and right audio-output: outputs sound on a 1-bit DAC

Five of those tasks have hard real-time IO constraints: the flash and USB interfaces, the sampler, and the two 1-bit DACs. The other components have softer constraints, in that they have to keep up with demand. Each of the tasks can execute on its own core, or its own real-time thread. Some of the tasks require more processing power than others, and the threads need to be sized accordingly.

As an example we show part of the implementation in Figure 4 below. This implementation is in the XC language, a derivative of C originated at XMOS, which allows the system designer to manage time, I/O, communication, and parallelism all in one simple C-like framework.

Figure 4. Sample code fragment for the MP3 application implemented in XC language.

The way that different software tasks interact with each other and with the I/O pins are highlighted with markers A-E:

A) declares that physical input port (an 8-pin port) will be used in the program to read data from under the name input
B) reads the current time, and stores the value in a variable
C) shows how data can be sampled, and how we can wait for a specific time. The specific time is dictated by the variable now, the sampled data is stored in sample
D) communicates the sampled data to the next process.E) shows a statement that executes four parallel activities, that are interconnected by means of three communication channels. These four activities are represented by the four red boxes shown in Figure 3.

Other blocks follow a similar structure. The equaliseAndOut function will create two processes, one to equalise, and one to do real-time output. What is left is a matter of distributing the program over the cores, making sure that enough resources are available.

Conclusions
Systems comprising many small cores are well suited to handling complex real-time tasks. They offer a software development cycle in which real-time tasks are allocated to (virtual) cores, where each core is guaranteed to adhere to the real-time constraints. Tasks that are not real-time critical can be executed using a more conventional scheduler.

This approach is scalable in that more cores can be added in order to increase computational power. The scalability is only limited by the capabilities of the network, and a network topology should be employed that has the desired bandwidth and latency. Since the performance and execution of all tasks is independent, this programming model allows proper modularisation, and offers containment of errors.

Compared to a traditional software design flow, a multi-core system offers fine grained scalability (using many small cores) and predictability of performance. Compared to a hardware design flow, a multi-core system enables the designer to predict system performance at design-time, rather than after a place-and-route.

Henk Muller has a Doctorate in Computer Science from the University of Amsterdam in the Netherlands and was a Reader in Computer Science at the University of Bristol where he headed the Mobile and Wearable Computing group. He is now the Principal Technologist at XMOS Limited. Contact him at henkm@xmos.com .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.