The authors discuss the pros and cons of the OMAP’s Programmable Real-Time Unit in the second of a three part series on porting the Windows CE 6.0 R2 embedded operating system to the Texas Instruments ARM-based family of OMAP-L138 processors.
Part 2: The Pros and Cons of OMAP's Programmable Real-Time Unit
In addition to the capabilities discussed in Part 1 in this series, an important feature of the OMAP-L138 SoC family that is of enormous benefit to a developer is the availability of a separate subsystem, the Programmable Real-Time Unit (PRU). The PRU is based on two 32-bit cores, each with its own memory for storage of commands and data. Applications of this subsystem (PRUSS) are diverse, such as implementation of additional interfaces or maintenance of interfaces in order to arrange specific protocols such as in an auxiliary DSP or ARM core.
Figure 6 shows the general structure of PRU subsystem (PRUSS). This subsystem contains two independent 32-bit cores, each with their own instruction sets, independent of either the DSP or ARM cores.
Click on image to enlarge.
These cores have a simplified RISC architecture that supports 40 commands with determined time of execution (1 time unit), making possible enhanced opportunity for handling bits in registers. The cores do not have a 'commands convey ' instruction or interrupt vector commands for hardware multiplication and division – all interrupts are processed in the mode of scanning via the indicator in one of the registers.
The PRUSS also has a general interrupt controller that allows unification of events from the peripheral, ARM, DSP and PRU cores. This controller can handle 32 events in two directions, both from the PRUSS to ARM and DSP cores, and from the ARM and DSP cores to PRUSS. Thus the interrupt controller can send any event to the similar interrupt controller on an associated ARM or DSP core, which in its turn leads to the call of the interrupt processors of these cores if they are enabled in the respective registers. Using the interrupt controller on a PRU module, it is possible to implement a simple interaction not only between PRU cores, but also between the ARM and DSP cores.
Figure 7 shows the structure of the PRU subsystems. Within it, each PRU core contains 32 registers, a process execution module, a table with 29 constants, and 4-Kbyte RAM commands. Independent fast input/output ports (GPIOs) associated with each core are connected directly to two registers, allowing the developer to make use of either the core's own communications interfaces or the GPIOs to interface to standard interfaces such as UART, CAN, or ProfiBus.
Click on image to enlarge.
Command RAM is available in the core itself and provides for the execution of instructions for any single time unit. All four SoC cores (ARM, DSP, PRU0, and PRU1) have access to RAM data, but each PRU core can execute a code only from its own command RAM, even though both PRU cores have access to all peripherals via the central bus.
Availability of sucn an embedded command set and data RAM allows the developer to unload the SoC central bus and implement interaction with the peripheral, mDDR/DDR 2 memory, and ARM/DSP cores with minimal load.
Optionally, the system can manage power and timing of the PRU subsystem. For subsystem timing, half of the ARM core frequency is used. This means that when the core is operating at a frequency of 450 MHz, it is possible to start the PRU cores at 225 MHz (4.4 ns per instruction). The power manager allows the PRU subsystem to be stopped or disabled when it is not needed, thus reducing the SoC's general power consumption.
There is no official compiler in the C language for the PRUSS, nor any official support in TI‘s Code Composer Studio that we were able to determine. Despite that it is possible to set the Code Composer Studio environment for automated compilation of the PRU module code for the convenience of program development and to bring all data into one project.
To implement the system execution code, a specialized version of the open source PASM compiler is applied that uses an assembler as a basic language. An example of the code for the PRU0 node is shown below:
MOV r0, 0x00000000
MOV r1, CTPPR_1
ST32 r0, r1
MOV r0, 0x00000000
MOV r1, CTPPR_0
ST32 r0, r1
MOV32 regEDMA_2_ICR, 0x01C02470
MOV32 regEDMA_3_ICR, 0x01C02670
// Initialize pointer to INTC registers
MOV32 regOffset, 0x00000000
// Clear SYS_EVT
MOV32 r31, 0x00000000
// Global enable of all host interrupts
LDI regVal.w0, 0x0001
SBCO regVal, CONST_PRUSSINTC, GER_OFFSET, 2
The PASM compiler supports several types of output files: binary, С-array, HEX-file, and other (including annotated listing). An example of an output file in the form of a C-array is shown below:
const unsigned int PRU0_Code =
The compiler locates the code directly from the zero address of the command RAM. This allows a C-file to be attached to the basic program, such as one that might be associated with an ARM core processor, and to copy data from the file directly to the command RAM of the appropriate core.
For environments other than Code Composer Studio, TI provides for the use of Notepad++ or TextPad for convenient code development with syntax highlighting. Setup files are provided with support of code syntax for the PRU module that has already been developed.
In BSP for Windows CE 6.0 for OMAP-L138 there is no support for the PRU subsystem. Officially, the code loader driver exists only in Linux and only for cases using a specialized patch. That is why during implementation of our projects a monolithic driver version of the PRU module was developed with support added for hardware interrupts from the PRU subsystem. This driver is configured to deliver a continuous stream of data during a specific interval of time between interrupts.
Figure 8 shows the driver subroutines needed for interaction with the OS and user applications. The PRU_Init software subroutine performs primary initialization of the driver and translates physical addresses of the memory allocated for the PRU subsystem into virtual ones for further use.
Click on image to enlarge.
The PRU_Deinit subroutine implements release of resources during the code loader driver uploading. The PRU_PreDeinit and PRU_PreClose subroutines are used as stubs. The rest of the subroutines are used for serving the software/hardware interface operations. Thus, the PRU_Open subroutine returns the device descriptor to the DeviceIOControl software subroutine. In its turn, PRU_Close performs context cleaning and is executed when calling the CloseHandle subroutine as the device descriptor is executed.
The PRU_PowerUp and PRU_PowerDown subroutines are used for notification of the PRU subsystem on transition to Suspend state and on cancellation of this state. In addition, the PRU_IOControl subroutine contains the whole functional implementation of the driver. When PRU_IOControl is called, the following operations are performed:
IOCTL_PRU_REQ_INT returns the system interrupt number that belongs to a specific event number (3…10) of the ARM-core interrupt controller;
IOCTL_PRU_RELESE_INT releases the system interrupts allocated using IOCTL_PRU_REQ_INT;
IOCTL_PRU_INT_INIT links a system event to a specific descriptor obtained from the API function of CreateEvent for further application of the WaitForSingleObject command with the help of an API software driver routine in the user application consisting of the following subroutines:
IOCTL_PRU_INT_DONE signals the core that the user application has processed the interrupt from PRU-core (InterruptDone analogue);
IOCTL_PRU_LOAD_CODE loads code into the command RAM of the PRU core (with a mandatory halt of the core). This sub routine also includes control of such operations as power starting of PRU subsystem in PSC controller (Power and Sleep Controller);
IOCTL_PRU_MAKE_SINGLESTEP starts program stepping (for debugging);
IOCTL_PRU_RUN starts PRU core for free program execution in the command RAM;
IOCTL_PRU_STOP stops PRU core;
IOCTL_PRU_WAIT_FOR_HALT waits for HALT command execution by PRU core;
IOCTL_PRU_SET_PC_STARTUP_POINT sets the program startup point;
IOCTL_PRU_SLEEP switches PRU core into the sleeping mode with the option for it to return to the normal mode on various events;
IOCTL_PRU_ENABLE_COUNTER switches the PRU core cycle counter;
IOCTL_PRU_GET_PC_COUNTER returns the current address of the command under execution;
IOCTL_PRU_GET_CYCLE_COUNT returns the cycle counter value;
IOCTL_PRU_SET_CYCLE_COUNT registers a new value of the cycle counter;
IOCTL_PRU_GET_STALL_COUNT returns the quantity of time units missed due to the code absence;
IOCTL_PRU_WRITE_GP logs in general-purpose registers (for debugging);
IOCTL_PRU_READ_GP reads from general-purpose registers (for debugging);
IOCTL_PRU_GET_DR AM_PTR returns the indicator to the data RAM area of PRU core translated to the user application memory area.
TheAPI software driver routine detailed above is a link between the device and the user application and is used to simplify the process of development and accelerate the final product manufacturing. For debugging of applications developed for the PRU subsystem, several methods are used. The one we prefer is the display of control points via data RAM and general-purpose registers with the help of the following hardware operations:
- interrupts of ARM/DSP- cores;
- use of the fast input/output port (R30 register); and
- the use of infinite cycles and storage in a register to indicate the current address of the executed command.
However, in general the choice of the debugging method depends upon the application type and convenience of the method (several methods may be combined).
DSPLink communications/loading options
When developingelectronic devices, there are a number of ways to lower the cost andsimplify the whole system. But in the majority of cases, the use ofmulticores offers many advantages, especially as it relates tomedia-intensive embedded designs.
In the design we wereconsidering, using multicores allowed us to create two and moreindependent cores that shared a common multilayer bus, common multipleDMA channels, a common set of peripherals, and to use a communicationschannel that could be shared between the cores.
The DSP subsystemin SoC OMAP-L138 includes a TMS320C674x processor core with the optionof operation at the frequency of 450 MHz, cache memory (instructions anddata), L2 RAM, and integrated debugging tools (Advanced EventTriggering – AET). Such calculation options allow implementation ofalgorithms for processing of images.
From the point of view ofWindows CE 6.0 OS, the DSP core itself is of no interest since ОСWindows CE 6.0 code cannot be started by the DSP core. Instead it ismore beneficial to use the DSP core for unrelated operations, such asresolution of complex tasks involving video data decoding and encoding.
TheDSP core can perform floating point mathematical operations, whichallows the developer to eliminate porting the necessary DSP algorithmsgenerated by a Matlab program. Otherwise, it would be necessary to havethem executed in software by the main ARM processor.
Forexample, in the case of signal processing operations such as fixed pointmathematical calculations, it would be necessary translate thealgorithm first into the floating point code, debug it, and then,observing numerous restrictions, port it to the DSP core.
Subsequentto floating point operations in applications such as video decoding andencoding, two different tasks arise. The first is to encode and decodevideo and audio data flows. The second is to implement an independentalgorithm at the DSP core. The latter task can also be divided into twobranches. The first is to use DSP/BIOS OS by the SoC manufacturer orsome other OS, the second is to develop a program that does not dependupon an OS (bare metal code).
Windows CE 6.0 OS allowsimplementation of either or both variants, especially with the provisionon the TI OMAP of a ready communications channel between the coresusing DSPLink, a library for the arrangement of interaction betweenprocessors (ARM<->DSP) using an already existing API. However,DSPLink presupposes using TI’s DSP/BIOS OS on the DSP side.
Figure 9 shows the structure of the interaction between ARM and DSP subsystemswith the use of the above DSPLink library. The OMAP-L138 SoC has noInter-Processor Communication (IPC) module. Instead, the device‘s L2 DSPRAM, shared RAM (Shared RAM), or mDDR/DDR2 RAM are used for informationexchange between the cores.
Clickon image to enlarge.
Figure 9: Interaction of ARM and DSP subsystems with the use of DSPLink library
Withthe DSPLink library it is possible to load a code into the DSP core andexecute it (and, of course, arranging for a channel for data exchangewith the ARM core using ready API.) This API not only allows executionof the current state of the DSP core, but is also used to arrangemessage exchanges between the cores (MSGQ), exchange flow data (CHNL),and create circular buffers (RingIO).
The basic purpose of thesemechanisms is to create an ecosystem for working with encoding anddecoding of audio and video data. TI provides codecs (in the form ofbinary libraries) for decoding and encoding of audio data (AAC, MP3 -decoding only, WMA), voice data (G.711, G.722, G.726), video data(H.264, MPEG2 – decoding only, MPEG4), and images (JPEG).
To read Part 1, go to “The basics of the two platforms”
To read Part 3, go to “Using the Windows 6.0 Board Support Package. “
Artsiom Staliarou and Denis Mihaevich are founders of the AXONIM Devices Company, a Microsoft EmbeddedPartner and independent embedded electronics system design center and systemintegrator with 25 engineers based in Minsk, Belarus. E-mail: , Skype: axonim.by.
Artsiom has a degree in radiophysics and has more than 10 years of experience in embedded system design based on ARM/Blackfin/TI DSP C2x/C5x/C6x)/x86 devices and using Embedded Linux/Windows EmbeddedOSes.
Denis also has a degree in radiophysics and more than 12 years ofexperience in embedded system design and video analysis algorithm development, and has a certificate in optoelectronics.