This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.
The ARM Cortex-A8 processor is the most advanced, high performance, low-power processor by ARM. Based on the ARMv7 architecture, the processor suits a variety of mobile and consumer applications, including mobile phones, STBs, game consoles and car navigation. As the core technology of Cortex-A8 processor,
NEON technology has the flexibility to implement multiple combinations of video encode/decode, 3D graphics, speech processing, audio decoding, image processing and baseband processing.
NEON technology is a 64,128bit single instruction multiple data stream (SIMD) instruction set. NEON supports 8-, 16-, 32-, 64bit integer and single precision floating- point SIMD operations to handle audio, video, image and other data processing. NEON technology has separate registers and pipeline, which is independent of the ARM integer pipeline.
Through the use of NEON technology's multimedia features, Cortex-A8 processor can decode MPEG4 VGA video (including the de-blocking filter, YUV to RGB conversion and other operations) at 275MHz with 30fps speed. NEON technology can execute an MP3 decoder with processor frequency lower than 10MHz.
Cortex-A8 NEON basics
The Cortex-A8 processor's NEON media processing engine pipeline starts at the end of the main integer pipeline. As a result, all exceptions and branch mispredictions are resolved before instructions reach it. More importantly, there is a zero load-use penalty for data in the Level-1 cache.
The ARM integer unit generates the addresses for NEON loads and stores as they pass through the pipeline, thus allowing data to be fetched from the Level-1 cache before it is required by a NEON data processing operation.
Deep instruction and load-data buffering between the NEON engine, the ARM integer unit and the memory system allow the latency of Level-2 accesses to be hidden for streamed data. A store buffer prevents NEON stores from blocking the pipeline and detects address collisions with the ARM integer unit accesses and NEON loads.
The NEON unit is decoupled from the main ARM integer pipeline by the NEON instruction queue (NIQ). The ARM Instruction Execute Unit can issue up to two valid instructions to the NEON unit each clock cycle. NEON has 128bit wide load and store paths to the Level-1 and Level-2 cache, and supports streaming from both.
The NEON engine has its own 10 stage pipeline that begins at the end ARM integer pipeline. Since all mispredicts and exceptions have been resolved in the ARM integer unit, once an instruction has been issued to the NEON engine it must be completed as it cannot generate exceptions. NEON instructions are issued and retired in-order. A data processing instruction is either a NEON integer instruction or a NEON floating-point instruction.
The Cortex-A8 NEON unit does not parallel issue two data-processing instructions to avoid the area overhead with duplicating the data-processing functional blocks, and to avoid timing critical paths and complexity overhead associated with the muxing of the read and write register ports.
The NEON integer data path consists of three pipelines: an integer multiply/accumulate pipeline (MAC), an integer Shift pipeline and an integer ALU pipeline. A load-store/permute pipeline is responsible for all NEON load/stores, data transfers to/from the integer unit, and data permute operations such as interleave and de-interleave. The NEON floating-point (NFP) data path has two main pipelines: a multiply pipeline and an add pipeline.
Compressing audio with Neon
Nowadays, WMA, MP3, AAC are the mainstream of audio compression algorithm(Figure 1, below ). From the applications and experiments of audio decoding and playback, it is found that the complexity is high and they take up lots of clock cycles.
|Figure 1: Shown is the flow diagram of an MP3 decoder.|
Especially, in the application of audio/video decoding, since the video decoding algorithm take up the large part of processor resource, limited source remains for audio decoding. Thus, it's essential to improve the efficiency of audio decoding in such application.
The MP3 is one of the most common audio compression algorithms, which is used in audio files and compressed audio/video streams. So, MP3 decoding is taken as the example to describe the NEON technology application in audio processing. The complexity of the MP3 decoder modules is listed in Table 1 below..
|Table 1: A list of MP3 decoder modules.|
The Huffman decode, IMDCT and sub-band synthesis filter modules take up the most of the computing time, which is about 90 percent of the whole computing time. Hence, if the computing time of these three parts is reduced, the efficiency of the whole MP3 decoder can be significantly improved.
Sub-band synthesis filter takes up about 50 percent computation in the MP3 decoder algorithm. Hence, sub-band synthesis filter is to be analyzed first. The filter contains matrix operation and PCM output window filter. The formula of matrix operation is:
The algorithm mainly includes multiply-add operation. ARM assembly code can be summarized as:
Since ARM multiply instruction (MUL) has to use pipeline 0, statement (1) and (2) cannot make the pipeline operation. The inputs of statement (3) are the output of statement (1) and (2).
So the three statements should execute one by one. Furthermore, each MUL instruction occupies two cycles. One multiply and one add operation need five cycles when running on ARM.
In sub-band synthesis filter, multiply-add is the main operation, which consumes many cycles at each operation. NEON can help in the situation. VMUL of NEON instruction finishes vector multiplication in one cycle, which is equivalent to two multiply operations. The multiply-add operation is converted into NEON code:
VMUL D1, D2, D3
D1~D3 are the independent NEON register vectors. D2 contains values of r2 and r5, while D3 contains values of r3 and r6. The operation result is stored in D1.
The one NEON instruction finishes 2 multiplications. Moreover, VMLA of NOEON instruction is equal to two multiply-add operations. After NEON optimization, it can reduce multiply-add operation time and the computing time of the module.
IMDCT is the second largest computing time consumption module in the MP3 decoder, about 25 percent of the total. IMDCT has 32 frequency sub-band. Each subband contains one long window or three sequential short windows. The long window is consists of 18 frequency lines, and short window consists of six frequency lines. The formula of IMDCT is:
After the algorithm level optimization, IMDCT is converted to the algorithm, which includes mainly multiply-add operation. It's similar to optimization method of sub-band synthesis filter that VMUL and VMLA of NEON can replace multiply-add instruction of ARM code efficiently. It reduces the computing time of the IMDCT module by a large margin.
The common audio decoders, such as WMA, AAC and OGG, contain a large number of discrete cosine transform, so the same method of NEON instruction optimization can be used.
Furthermore, for multimedia processing features, NEON instruction set provides a range of optimized media processing instructions, such as the saturated vector operations, vector load/store and so on. If they are used properly, the optimization effect is very significant.
Implementing NEON on the i.MX51
The i.MX51 multimedia application processor is Freescale's high performance and low power consumption processor. The processor is based on ARM cortex-A8 architecture, which can run at up to 1GHz and allows it to be used in a wide variety of application such as PMP, PND, PDA etc.
Since i.MX51 is designed for multimedia application, audio processing is the one of the essential applications. Here are the advantages of optimizing audio processing on i.MX51:
1) reduction of the load on processors to achieve higher processing capacity; and,
2) In the audio playback mode, the ARM Core can remain dormant longer, for low-power playback.
Application of the above demand on the i.MX51 is feasible because, first of all, the i.MX51 processor is based on the ARM Cortex- A8 architecture, supporting NEON technology.
Therefore, the method described can be used on the audio decoder with NEON instruction-level optimization to reduce the computing time. Secondly, in fact, NEON processing engine will cause the chip power consumption to rise. i.MX51 processor addresses this issue, dedicated to the NEON control module.
The module principle is that, when NEON is not being used within the n cycles (n configurable), interrupt request will be issued. The software system can open or close the NEON processing engine according to the use of NEON.
Optimizing audio decoder will not be repeated here. The next figure focuses on the use of the NEON processing engine. A flow diagram is shown in Figure 2, below .
|Figure 2: Shown is a flow diagram of NEON usage on i.MX51.|
When the system executes the NEON instruction, the NEON instruction causes the UNDEFINE instruction exception. In the exception handling, it first determines whether the NEON processing engine has been enabled.
If NEON is not enabled, the software starts the NEON power and then enables NEON. Later, ARM Core begins execution of the NEON instruction. After the NEON instruction execution is complete, NEON processing engine goes into the IDLE state, and the ARM Core continues the execu tion of the following ARM code.
At the same time, NEON Monitor on the i.MX51 processor constantly monitors NEON processing engine working condition.
Software developers can write to the registers NEON Monitor idle-waiting time n. Thus, when NEON has not been used within the n cycles, the IRQ will be issued by NEON Monitor. Software interrupt handler disables NEON, and closes the NEON power to save power consumption on the NEON processing engine.
After optimization of NEON instructions, each processor computing time had a significant decline. Data is shown in Table 2 below
|Table 2: The NEON technology can improve algorithm efficiency, reduce computing time and realize high performance, low power consumption system in audio application.|
.Audio decoding computing time decreases significantly. The biggest benefit to the system is that, in audio playback applications, it can make ARM Core in a dormant state to achieve lower system power consumption. On the i.MX51 development board and eCos OS, power consumption of audio decoding is evaluated (ARM Core running at 200MHz, ARM Core voltage of 0.7V).
In Table 2, the last column lists the ARM Core's power. By using NEON technology, the power consumption of MP3/WMA decoder declined, especially after using NEON in WMA decoder, ARM Core's power consumption dropped from 4.38mW to 3.47mW, almost 20 percent.
Yu Xu is an Embedded Software Engineer at Freescale Semiconductor Inc.