Designing a low-cost, low-power multicore ARM-based AV player

Low power system design has become a mandatory requirement not only for hand-held mobile devices but also for automotive infotainment systems. Furthermore, automotive systems need to be able to carry the user experience in the home and office over to the car. This increasing demand for high computation, high quality and low power solutions has forced embedded computing to turn to multicore systems. Consequently, embedded solutions based on multicore platforms have become common for many applications such as gaming, video, and image-processing in areas such as mobile, automotive, medical and industrial applications.

The challenge is to utilize multicore based system with available media frameworks that offer scalability of open source access and portability to achieve requirements for low-power, high-performance embedded applications. With the advent of powerful software and hardware programmable system-on-chip (SoC) FPGA devices, embedded system designers are able to design solutions to an exact form, function and performance fit for the requirements of the customer. These solutions are optimal, efficient, cost-effective addressing end customer requirements. The Xilinx Zynq SoC family of FPGAs is such a device.

This article describes the use of a low-cost, low-density Zynq FPGA in creating a computational platform for implementing infotainment systems for passenger vehicles, such as cars, buses, trains, airplanes and ships. Other applications for this kind of platform include digital signage and information displays in private and public venues such as hotels, hospitals, gas pumps, or kiosks as well as digital picture frames for consumer markets.

Overview
In-vehicle infotainment experiences face a dual requirement of matching the home or office user experience while meeting energy efficient requirement for automotive industry. In our case, the specific requirement was to build very low cost, low power, 720p30fps AV player solution with Video-Audio sync functionality able to interface with the customer’s hardware block implemented on programmable fabric. This objective was to be achieved with development of tightly coupled multicore software with real-time acceleration through hardware with an eye on vastly lower BOM, lower NRE costs, lower design risk and most important, faster time-to-market.

We proceeded to break down the requirements for the Atria Logic AL-AVPLR-IPC AV player to consist of a file reader, a de-multiplexer, an H.264 Baseline Profile HD decoder with color space converter, and an AAC-LC stereo decoder. Also included is an AV player application with build-in GUI for basic player operation, such as Play, Pause, Stop and Fast-Forward trick mode. OS support is via Ubuntu LTS as it provides multicore usages with very efficient core utilization factor. The AV player application is fully Linux GTK based, while the decoder itself is fully implemented in the Linux GStreamer open source multimedia framework. A block diagram of the player is shown below in Figure 1.

click for larger image

Figure 1: Atria Logic AV Player block diagram (Source: Atria Logic Inc)  

We zeroed in on the Xilinx Zynq-7000 device family’s Z-7010 as the most suitable option for this implementation meeting all the requirements. This ARM powered programmable SoC provides maximum CPU performance with the best thermal performance. At the same time, this device provides enough programmable fabric for those requirements that require accelerator engines able to fit into the logic arrays, ensuring sufficient performance even for this low power platform. One more reason for us to go with this device is the readily available, small form factor Zybo development board which expedited our development efforts for quick turn around.

In addition to programmable logic for custom implementation of glue logic, RAM and DSP functions, the Zynq architecture includes a dual core ARM Cortex-A9 CPU with Neon DSP engines, a complete array of serial I/O, USB and PCIe interfaces, encryption/decryption engine, GbE and memory controllers. Low power, integrated CAN bus interfaces and extended temperature variants make the Zynq family of fully HW and SW programmable SoCs an excellent fit for low power, low cost automotive infotainment applications.

Video decoding is handled up to HD resolutions of 1280×720 at 30 frames/sec, and stereo audio decoding at 48KHz. Audio and video are kept in perfect sync so that lip sync issues are avoided.

The challenge was to leave as much as possible programmable logic available for implementation of other functionality, while keeping power dissipation low. This meant that the implementation needed to take full advantage of the available ARM cores and Neon DSPs, while optimizing the firmware to run as efficient as possible. This was achieved by taking full advantage of the multi-threading and symmetric multi-processing (SMP) capabilities in Ubuntu.

Multi-Threaded Dual Core Player
The player specifies audio and video as part of pipelines running on separate threads for parallel execution on the two ARM Cortex-A9 cores. The AAC-LC audio decoder load is much smaller than the H.264 video decoder. Therefore, the video decoder is divided into two threads to take full advantage of both cores, which are running at 667MHz each. In Figure 2, Video Thread2 spawns a new thread, Thread4 and these two video threads are executed on two different cores.

click for larger image

Figure 2: Multi-threaded video and audio decode execution (Source: Atria Logic Inc)  

The AV player GUI application program starts on single thread, Thread1. Compressed audio data and video data are queued separately in Thread2 and Thread3, maintaining the stack overflow limits. De-multiplexing the audio and the video data in these two threads decouple the processing of the sink data in separate audio and video processing threads.

Continue reading on page two >>

The Ubuntu OS needs to estimate the maximum stack space that will be required for each thread. Multi-queue or separate queue elements for each thread or branch are used for tasks to get executed in parallel or in pipeline. All necessary locks for OS mutexes are initiated. Spawning a new video thread or keeping the audio thread on a particular core on small hold is done with the correct locking and unlocking of the memory sub-system stack space. On return of a thread, the process gets destroyed. However, Linux bare bone threads cannot automatically de-allocate the local stack memory. Thus, the parent process reclaims the memory space as thread clean up by adding stack space to the malloc ( ) free list. The code memory used for the AV Player is 16kB while the data memory is limited to 4kB. Overall system memory used for complete framework is limited to 512MB. This can be optimised to great extent by keeping only the used features of the Gstreamer framework and knocking out unused ones.

The multi-threading approach accomplishes the audio and video decoding, utilizing both cores optimally. Gstreamer was initially running the player inefficiently with CPU1 utilization of only 66% and CPU2 utilization of only 81%. Thread pipelining achieves nearly 100% core utilization, an improvement of 29-34%.

Efficiency and Performance with Neon DSP
A nearly 100% core utilization, however, was not sufficient to meet the performance requirements of the player. One option was to implement color space conversion in the programmable logic portion of the SoC. However, that would reduce the amount of programmable logic available for other system logic that might possibly be needed. Futhermore, it would not yield the proper quality using fixed point logic.

Therefore, we opted to implement the color space conversion module on one of the Neon DSP engines. The Cortex-A9 Neon MPE extends the CPU core functionality with the ARM v7 advanced SIMD and vector floating point instruction sets. Taking full advantage of the Neon DSP’s ability to perform floating point computations with 32-bit, 64-bit and 128-bit quad registers significantly accelerated the YUV420 to RGB444 color space conversions, and by utilizing parallel vector data arrangements for the SIMD operations, we managed to limit utilization of the two Neon DSP engines to only one.

The listing below shows the sample code listing of the color space conversion routine, using a Neon DSP engine. The Neon DSP also enabled parallel loads and stores of multiple registers, with on the fly data de-interleaving.


Listing: YUV to RGB color space conversion routine from on a Neon DSP engine (Source: Atria Logic Inc)  

Performance Results
The Neon DSP code optimization for the color space conversion matrix routines was 1.4x to 1.64x. Table 1 lists the details of the results for different streams with different bitrates, sizes and complexities.

Video Streams Bit Rate Resolution Decoding FPS on Dual Core each @650MHz Frame Rate on 2A9 + Neon Neon Optimization Factor
H264_720p_BP_1mbps.264 1 Mbps 1280×720 28 42 1.5
H264_720p_BP_4mbps.264 4 Mbps 1280×720 22 32 1.4
H264_720p_BP_6mbps.264 6 Mbps 1280×720 21 33 1.57
H264_1080p_BP_2mbps.264 2 Mbps 1280×720 14 20 1.64

Table 1: Performance results for H.264+AAC Player with Neon DSP color space conversion code (Source: Atria Logic Inc)

Additional opportunities for load reduction include shifting the de-blocking filter execution to one of the Neon DSP engines.

On a Xilinx Z-7010 SoC FPGA with the ARM Cortex-A9 cores running at 667MHz, the implementation achieves decoding of 1280×720 at 20 frames/sec video and AAC-LC stereo audio at a utilization of nearly 100% capacity. In the Figure 3, CPU1 core utilization shows 99.3% utilization, while the CPU2 shows 100% utilization.

click for larger image

Figure 3: AV Player multicore loading (Source: Atria Logic Inc)  

Automotive Functional Safety & Quality
The AV Player implementation on the Zynq Z-7010 fully conforms to the MISRA 2000 compliance C code coding guidelines, fully passing the static QAC check. The implementation was also done in compliance with the ISO26262 safety compliance ASIL B guidelines.

Conclusions
With ever increasing need for low power, low cost solutions for various spectrums of applications, programmable SoC computational platforms provide unique solution with their value proposition. Complete value utilization comes from fullest use of available CPU cores and DSP engines wherever needed (for example image color space and its dynamic range quality or audio post processing enhancement or vision computing quality and accuracy) or programmable fabric for necessary hardware acceleration or performance improvement processing (both for video and/or vision).This, augmented with latest multicore processing techniques, can provide the foundation for designs targeting such applications.

One such solution for automotive infotainment, Atria Logic AL-AVPLR-IPC AV player achieves an optimal balance of performance, density and performance on the Xilinx Zynq Z-7010, with all of the device’s programmable logic remaining available for integration of other board logic, minimizing BOM cost and increasing system reliability. This approach offers an effective solution for implementing AV playback for a host of applications, such as infotainment systems for passenger vehicles, such as infotainment systems, digital signage and information, and digital picture frames that include video capabilities.

References

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.