Oz Levia of Improv Systems describes the design and use of the VLIW Jazz configurable processor.
Consumer and communication products typically support applicationsthat require extensive and intensive computation and transformation
of data as an integral part of the product. Some examples of such
application include video decoding and encoding, voice compression
and decompression, image processing and compression, audio play back,
and communication channel decode and encode.
Each such product demands a solution that is efficient in area and
power and is optimized for application specific requirements. The
specific requirement is a function of the application and the product
it is destined for. For example, a video product for hand held
devices would require low power and low cost with lower image
quality. A video product for broadcast markets would require very
high performance and very high image quality.
In addition, many products today require support for multiple
applications or multiple formats of the same application. Video
products that today support MPEG2 would need to add support for MPEG4
and H26L in addition to MSWM and other video standards. Communication
devices for WLAN will need to support different versions of the
802.11 standard and possibly other wireless standard formats.
At the core of such products is a digital signal processing (DSP)
unit that is the workhorse of compute-intensive applications. To be
effective for such demanding products and applications, a DSP should
be programmable, configurable, and scalable. The configurable Jazz
DSP is a programmable core that was designed by Improv. It is used in
consumer and communication products that require high performance,
low power and flexible architecture that can be adapted to the
specific needs of an application.
Before we describe the architecture and specific features of the
Jazz DSP, we would like to motivate the need for a programmable,
scalable, and configurable DSP. The overall goal is simple: to
provide in a single architecture framework support for many
applications with different requirements while affording the use of
high-level language.
Useful definitions
A few definitions are useful: a programmable DSP is a DSP that can
process instructions and does not execute a fixed function.
Instructions for a DSP can be produced by a compiler from high level
language (C, Java, C++) or can be written by hand (assembly code). A
configurable DSP is a DSP that can be modified in one or more ways
that will fit the needs of a products or an application. Configurable
DSPs are modified before they are implemented in silicon. Finally, a
scalable DSP is a DSP that can 'grow' or 'shrink' to support
different requirements in the products. In some cases, that
capability may be extended to support multiple DSPs.
Fig 1: Multiple Jazz DSP processors in a
platform
The most obvious and immediate application of a configurable
processor is to provide support for optimization of application run
time. Configuration of the DSP processor can enhance performance in
three distinct ways.
- Scale: By increasing and decreasing the available resources in
the DSP processor, an application can experience different levels
of performance. Scalability can be in the form of more resources
in a single DSP, or by using multiple DSP cores. However, to take
advantage of such scalability the DSP must be complemented with a
powerful compiler that can make use of additional resources for
performance optimizations. Without a compiler, configurable DSPs
and configurable processors will only contribute to a lengthy
design cycle.
- Location and mix: By changing the organization of the
resources (without any addition of resources) the configurable DSP
can provide different levels of performance for different
applications. For example, register location (to avoid spills) can
have significant influence over inner loop performance. Again, a
strong compiler is a crucial component in using such a
configuration.
- Custom Instructions: Every application has some operations
(mathematical, or otherwise) that are less suitable for a given
DSP. It is not practical for a single DSP to include ALL types of
such instructions or operations. Nor is it practical for DSP
vendors to design in advance all such operation and instructions.
Configurable DSP gives the user the option of inserting custom
instructions into the DSP for the benefit of the application.
Fig 2: DDCU (with custom instructions) in a Jazz
DSP
Performance is but one dimension of optimization for a specific
product or application. In some cases, power consumption or area
(cost) are priority objectives. Configurable DSPs allow the trade off
between performance, area and power. For a given level of performance
a configurable DSP will provide the best area and power consumption
in a programmable DSP.
Unlike standard DSPs where higher performance means higher clock
cycle and higher power consumption the use of custom instruction in
configurable DSPs may actually result in higher performance and lower
power consumption, because fewer clock cycles and less logic are used
to compute the same result.
The flexibility and productivity afforded by a configurable DSP is
of little use without programmability. Two related capabilities are
significant in considering processor programmability.
- Instruction execution: The DSP executes the application from
an instruction stream (object code). This capability is now
mandated for products and applications that support multiple
standards or formats, since it is impractical to use
fixed-function DSPs to support yet unknown or a large number of
formats. By writing and running a new instruction stream,
programmable DSPs can be used in more situations and cases.
- High level language: C and Java. It is hard to over-emphasize
the importance of high-level language programmability for any
processor. High-level language provides productivity (easier to
write and verify), flexibility (easier to change), and
maintainability (easier to debug and modify). Without high level
language, programmable cores are of reduced value. The key in
gleaning benefits from high-level language support is the
availability of an efficient compiler and optimizer.
The configurable Jazz DSP is a Very long Instruction Word (VLIW)
programmable DSP. The Jazz DSP is unique among programmable DSPs in
that it is configurable and scalable. It is possible to customize the
Jazz DSP for application specific requirements and optimizations.
Jazz is also unique among VLIW processors because it has an efficient
high-level (C and Java) optimizing compiler.
The Jazz DSP was also designed to scale for the computational need
of an application. The user can configure Jazz such that it has just
the right number of parallel execution units that fit the performance
requirement of a given application. It is also possible to organize
multiple processors to work together in an array.
Fig 3: Jazz 2020 - high performance, low power DSP core
Being a VLIW DSP, multiple ALU-like structures called
Computational Units (CU) are organized such that an instruction slot
controls each one. When two CUs share an instruction slot they are
said to be overlaid and are not available for use simultaneously. CUs
provide the mix of instructions available in the Jazz processor and
the organization of CUs in the instruction slots govern the number
and type of instructions available in each cycle.
Typical CUs are ALU units with arithmetic operations, MAC units,
Shift units, counters, and other instruction units. The Jazz DSP
supports 32 and 16bit data path in fixed point arithmetic. Most CUs
support SIMD operation for Byte and half words data.
In a similar way, Memory Interface Units (MIU) are also organized
and controlled by instruction slots. MIUs provide for data access and
address generation. Jazz supports 16 bits address space.
Registers are distributed in the Jazz DSP without a use of a
single register file. This puts computational results close to where
they are generated and consumed. It also allows more flexibility for
the compiler in the process of allocating data to storage
locations.
A data communication block facilitates data movement between
storage locations and inputs to CUs. This block includes several
MUXed buses that are under the control of the compiler.
Applications that require digital signal processing typically
require very high bandwidth of computation. For example, a simple
transformation of 16 and 32bit words of data can take several
hundreds of different computations such as multiplication and
summation. That, coupled with large amounts of data, dictate a need
for a very efficient computation platform.
One approach to get many computations accomplished each second is
to perform as may operations as possible simultaneously. VLIW
processors are designed specifically for this goal.
Since each instruction-cycle word contains multiple slots, it is
possible (for the compiler) to specify multiple actions each cycle
&endash; one for each slot. The result is a processor that can
execute several OPS per cycle. VLIW is especially useful for digital
signal processing since DSP operations tend to be regular and
repetitious with little or no control code.
VLIW for configurable DSP
Jazz is a pure VLIW architecture, where one slot in the
instruction word controls each operation. This type of VLIW
architecture is very easy to control and understand and typically
will yield very high performance. In addition, a pure VLIW
architecture is also very easy to extend and configure. We take full
advantage of this capability in our design of the Jazz DSP.
The most obvious parameter to control in a VLIW is the number of
slots &endash; or in other words, the instruction level parallelism.
By adding slots to the instructions (and CUs to Jazz) one can
increase or decrease the number of operations that are available in
each cycle. Since CUs and MIUs are controlled in much the same way,
it is also possible to extend the available memory interface units.
At the same time, adding additional CUs to Jazz has the effect of
varying the mix of available resources in the processor to the
benefit of the compiler's ability to efficiently schedule parallel
operations.
That single capability, to add or remove slots from the
instruction, allows the use of the Jazz DSP in a very large number of
configurations and can result in an optimal processor for a given
application.
Data and computation-intensive application computational resources
are but part of the problem; another is temporary storage. In the
Jazz DSP, it is possible to add, remove and even re-organize data and
address registers in different locations in the architecture to allow
the compiler to make better use of temporary storage.
Since Jazz is a pure VLIW, it is also possible to insert into a
slot a custom CU. Such a Designer Defined Computational Unit (DDCU)
can include custom instructions and can add to the available
vocabulary of the compiler. This capability can further optimize the
DSP for the needs of an application.
Flexibility, parallelism, and configurability come with a cost. In
this case, the cost is complexity of targeting high-level code to the
Jazz DSP. Considering support for such flexibility without a good
compiler is impractical.
The Jazz System Compiler is a VLIW optimizing compiler that can
turn sequential C code into VLIW instructions using as many
operations per cycle as possible. The compiler can also 'understand'
changes to the Jazz DSP configuration (additions or deletions) and
can even generate code for DDCUs inserted by the user for application
specific instructions.
Configuring and optimising
The goal of a configurable, scalable DSP, as described in the
previous sections, is to enable a good match between applications and
compute platform. This goal would be hard to attain if not for a
complete methodology that supported the user in the quest for
optimizations. This flow is illustrated in figure 4.
Fig 4: Design flow for configurable DSP
The star of the show, so to speak, is the digital signal
processing application. That application is expressed in C or Java
high-level code and is verified and tested on the host. In addition
to the functional verification, it is critical to have a specific
goal in mind that can drive the process of optimization. The goal may
be oriented towards performance, cost, area, or power consumption. In
many cases the goal is a combination of all of the above.
The application and the end product will also have an impact on
the specific Jazz DSP platform that is used as a starting point. For
example, in a hand held device that requires low power and an
application that requires 500-1000 MIPS, the choice will be a single
processor with low level of parallelism. For a broadcast video
application with high screen resolution, the selection may be for a
high-end, two processor platform.
In some ways, the starting point will not effect the end result,
but will make the process of convergence faster and easier.
Mapping
The Jazz PSA System compiler is an optimizing VLIW compiler. Given
a specific application and a specific Jazz DSP, the compiler will map
the C code to object code for the specific Jazz DSP. The process of
mapping is global and contains many optimizations and transformations
&endash; all are aimed at producing the best fit between the source
code and the Jazz DSP under consideration. The overriding goal is
performance. The compiler will try to minimize the number of cycles
required to execute a given application. Another consideration is
space. The compiler will work to minimize the space required for
instructions (object code) and for data.
As noted before, the most effective way of optimizing a DSP
algorithm is to do as much work as possible in each cycle. Since the
amount of work a given Jazz DSP can do in a cycle is known, the job
of the compiler is to fill each cycle with useful work and to use as
few cycles as possible.
The compiler can do its best, but sometimes there are parts of the
application that just don't map well to a given Jazz DSP. The process
of configuration starts with providing the user with data to point to
areas that can be improved:
- High percentage code: Portion of code that consumes large
numbers of cycles (dynamically) and thus deserves more
attention.
- Low utilization code: Portion of the code where the compiler
was unable to make good use of the resources of the Jazz DSP.
- Poorly used resources: Resources in the Jazz DSP that are
under used or that could be eliminated. Other feedback may point
to resources that can be organized in different ways to better
take advantage of parallelism or to conserve space and time.
- Missing resources: Resources (registers, instructions) that if
added can make a measurable improvement in the performance of the
application.
Using results from compilation and profile, the user can then make
changes in the code and configuration. Changes may include simple
loop re-write, change of types used, or other syntax changes that may
make the work of the compiler more straightforward.
The user can also make changes to the configuration of the Jazz
DSP processor. The user can add or remove resources, can re-organize
the resources in different ways, and can insert new custom
instructions that best fit the needs of the application.
Making trade-offs
The process is by its nature, iterative. The user makes changes
&endash; compile - and profile. Each cycle, getting closer to the
target. The key in this process is to keep front-and-center the goals
for optimization and to understand how changes in one area (i.e. to
improve performance) will effect other areas such as power
consumption. The user is aided in this process with sophisticated
tools that give accurate and early predictions of the impact of a
change in the configuration.
Embedded DSP applications require best performance at low power
consumption but without loss of flexibility. Programmable,
configurable DSP processors like the Jazz DSP are ideal for this
task. Supported with advanced optimization methodology and a superb
compiler, the Jazz DSP is the first member in a generation of new
cores that provide flexibility and optimization together with rapid
design methodology.
Oz Levia is the chief technical officer, senior vice president of
field operations and a founder of Improv Systems. He is the
editor-in-chief of a series of books entitled Current Issues in
Electronic Modeling (Kluwer Academic Publishers, currently 11
volumes) and has published over 30 papers on electronic design, high
level synthesis and architectural modelling.
Published in Embedded Systems (Europe) September
2002