This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.
Performance is an issue constantly raised about the Java platform.Java's portability is also a major disadvantage, as bytecode mustalways undergo some form of conversion to run on the native instructionset of the underlying architecture. The feature-rich demands ofnext-generation Java applications will quickly outstrip thecapabilities of current massmarket Java handsets.
Hardware graphics accelerators, increasing processor clock speeds,and fast data transfer rates are all changing the application typesthat can run on mobile devices. If Java is to keep pace, Java platformperformance must improve and a powerfulJava Virtual Machine (JVM) must be used.
|Figure1: To reduce die size and improve performance, Jazelle DBX isimplemented in the ARM pipeline as a finite state machine rather than atraditional microcode engine.|
Traditional methods of improving Java execution speed include softwaresolutions – such as optimized JVMs, just-in-time (JIT) or ahead-of-time (AOT)compilers – and hardware solutions – such as dedicated Javaprocessors and Java co-processors.
Depending on the system, high speed levels can be achieved usingthese methods. However, delivering this performance on an embeddedplatform has typically involved power, memory or platform cost. JIT andAOT compilers compile code for immediate execution on the targetdevice. An AOT compiler compiles all code after application download,some of which may not even run during execution.
A JIT compiler, meanwhile, compiles code “on sight” – i.e. justprior to execution. On an embedded device, JIT compilation causes adelay between an application's launch and its actual run. Research haslikewise shown that dynamically compiled code expands four to sixtimes. So, in addition to slow application startup with a JIT, extramemory is required for the code compiled by JIT and AOT solutions.
Using a hybrid software is one way to address issues with JIT compilerson embedded systems. This software solution is often referred to as dynamic adaptive compiler (DAC),which combines a JIT compiler and a bytecodeinterpreter.
Bytecodes are initially executed by interpretation, while thesoftware profiles the code and determines key code sections to becompiled. Once compiled, these key code sections are run as nativecode.
Using a DAC may diminish JIT compiler problems, but the maximumspeed achievable with a true JIT is compromised, and start-up time andcode bloat may still be significant. Until code has been profiled, anapplication will run in a slow interpreter mode, then pause to generatecompiled code.
When an application is launched, many methods are run only once, sothey should not be compiled. This significantly impacts userexperience, particularly at application start-up when a device appearsun-responsive for a long period.
Because software interpretation is very slow, most DAC solutions dolittle profiling and compile almost all methods immediately, taking agamble that a method is not to be executed for the last time, but manymore times. This could be a costly risk – not only is time consumedcompiling an overhead, but the compiled code is also using up memoryfor no gain.
Finally, when memory runs low, a DAC must discard previouslycompiled code and may recompile or compile a new one. This usuallyleads to efficiency problems as an application pauses while the DAC iscompiling.
This can be seen as a user moves to a new scene in a game forexample. Despite these drawbacks, solutions such as DAC and evenpartial AOT become attractive as embedded devices increase incapability, particularly the available RAM and ROM.
However, a parallel trend is for more system platforms to be writtenin Java, more downloaded applications to become Java applications, andmultiple Java applications to be able to run concurrently. Thus,available memory for Java is constantly being pushed to the limit.
Hardware solutions for accelerating Java execution usually requireadditional silicon footprint and power as well as external memory, thusthey don't maximize speed.
Dedicated Java processors directly execute Java bytecode within theprocessor. Although they appear to offer acceptable performance,dedicated Java processors represent a significant overhead, additionalintegration and development complexity.
Because they don't support existing applications or establishedoperating systems, they must always operate alongside anotherprocessor. Java co-processors translate Java bytecode into the existingcore's instructions. This acceleration process often requires asignificant hardware and software integration effort; incorporating itinto the existing OS is difficult.
|Figure2: From a technical perspective, integration is very easy. There isonly one program counter, and all of the Java states is held in ARMregisters.|
Co-processors are expensive to manufacture; they require extra spacefor the gates and extra power to operate. They also tend to runrelatively slow because they are loosely coupled with the coreprocessor.
To execute Java bytecode directly in the processor core,architectural extensions can be used. These extensions offer optimalperformance along with OS and application compatibility withoutrequiring extra hardware or memory.
By placing an additional instruction set inside the processor, anarchitectural extension reuses all existing processor resources withoutthe need to re-engineer the architecture or add cost, power or memory.
An extended core can efficiently run both Java and native code,enabling developers to leverage the existing base of applications andOS expertise while balancing Java portability and native performance oftheir application.
ARM has introduced a family of architecture extensions calledJazelle. One product in this family, Jazelle DBX, focuses on Javaexecution on resource-constrained devices. Jazelle RCT, the latestaddition, provides support for JIT, AOT and DAC compilation techniquesappropriate to many bytecode-like languages, including Java, .NET MSIL,Python and Perl.
Jazelle DBX tech
Traditionally, ARM processors support two instruction sets – the ARM,in which all instructions are 32bits long, and the Thumb, whichcompresses the most commonly used instructions into a 16bit format.
The Thumb instruction set typically offers 35 percent to 40 percentcode compression compared to ARM code, which may reduce performanceslightly. The instruction set supports procedure calls between ARM andThumb code, so application programmers typically decide at compile timewhether parts of the application should be compiled for performance orcode density. Jazelle DBX architectural extension introduces a thirdinstruction set – Java bytecode.
This instruction set creates a new state in which the processorfetches and decodes Java bytecodes and maintains the Java operandstack. To reduce die size and improve performance, Jazelle DBX isimplemented in the ARM pipeline as a finite state machine rather than atraditional microcode engine.
It is also implemented on the processor side of the cache – whichhas benefits in terms of power consumption and performance – unlike aco-processor or dedicated processor solution, which must implementtheir own caching solution. Entering and exiting Java applications issimple and can easily be put under the control of any OS.
However, it is important to ensure Jazelle DBX doesn't affectreal-time interrupt performance and compatibility with existingARM-compliant exception code in operating systems.
|Figure3: Adding Jazelle DBX to a DAC compiler solution means this technologycan be used successfully on more resource-constrained platforms bycompiling less or not compiling at all.|
There is a single new ARM Instruction – “branch-to-Java” forentering Java state. First, this instruction performs a test on one ofthe condition codes. If the condition is met, it puts the processorinto Java state, branches to a specified target address and beginsexecuting Java bytecodes.
Once in Java state, the ARM PC is extended to 32bits to address Javabytecode. Bytecodes are fetched and decoded in two stages (compared toa single decode stage when in ARM/Thumb state). Jazelle DBX performs32bit fetches to fetch up to four Java bytecodes at once, which has asignificant performance benefit.
A current processor state register (CPSR) bit records the processorstate. This is an important feature, as the CPSR is automatically savedand restored when handling interrupts and exceptions. Hence, anyinterrupt routine that saves machine state on entry and restores it onexit is automatically compatible with Jazelle DBX.
Jazelle DBX's implementation allows all Java instructions to berestartable. An interrupt can be taken in the middle of an executedJava instruction in such a way that interrupt latency is not affected,ensuring realtime interrupt performance.
In Java state, the processor assigns several ARM registers tofunctions specific to the Java machine (stack pointer, top 4 elementsof stack, local variable 0 etc). This hardware reuse is important incontributing to the small size of the additional logic (12kgates)required to implement the Java machine.
It also has the benefit of keeping all of the states required by theJazelle DBX technology extension in ARM registers. This ensurescompatibility with existing operating systems, interrupt handlers andexception code.
Keeping the top four elements of stack in ARM registers is importantfor the processor's performance when executing Java. Applicationprofiling has shown that the working stack depth for most applicationsis very small. Hence, this technique reduces memory accesses to aminimum. Stack spill and underflow are handled automatically by thehardware.
Jazelle DBX typically increases the performance of a highlyoptimized commercial JVM by about two and four times when runningbenchmarks or complex MIDP 2.0 applications. In addition, all Javabytecodes are restartable, thus, there is no overhead on real-timeperformance.Other factors to consider
When looking at Java on embedded devices, raw speed performance is notthe only factor to consider. Power consumption, memory usage (RAM andROM), ease of integration, system cost and user experience are allequally important, and achieving the right balance among theseconstraints is essential.
Jazelle DBX divides Java Bytecodes into three classes: directlyexecuted, emulated and undefined. The majority of the Java bytecodes(134 on the ARM926EJ-S processor) are executed directly in hardware;the remainder is emulated by short sequences of highly optimized ARMinstructions.
This is typically done by removing the interpreter loop from thevirtual machine and replacing it with ARM's proprietary support codecalled VMZ, which is no larger than the code taken out.
Application profiling has shown that the emulated bytecodes areencountered less than 5 percent of the time in typical applicationcode. This means that only approximately 12K gates are required toimplement the ARM Java extensions, much smaller than most dedicatedprocessors or co-processors, which are typically between 60-100Kgates.
The minimal complexity of the additional logic required to implementthe extensions keeps power consumption low and system integrationsimple, without significantly compromising performance.
Undefined bytecodes are distinct from emulated bytecodes.Encountering any undefined Java bytecode will cause the processor toleave Java state and return to an exception handler written in ARMcode, which is normally part of the VMZ. This also provides a mechanismfor supporting future extensions of the Java bytecode set; a softwarepatch can implement a new bytecode function.
From a technical perspective, integration is very easy. There isonly one program counter, and all of the Java state is held in ARMregisters. Hence, the extensions are consistent and compatible withexisting interrupt and exception models.
In the target applications, there are several established platformoperating systems – WindowsCE, SymbianOS, PalmOS, Linux, and manyreal-time and proprietary OS. In developing the architectureextensions, ARM has worked with many of the OS vendors to ensure thatsupport for the extensions is available.
|Table1: Start time is significantly reduced by the addition of Jazelle DBXtechnology to a dynamic compilation solution.|
Table 1 above shows startuptime of MultiVector and some typical, complex Java applications fromOplayo and Navitime. These run on a commercial VM incorporating adynamic compiler with Jazelle DBX enabled, with Jazelle DBX disabled,and with ARM's own JVM running on Jazelle DBX.
Data show that start time is significantly reduced by the additionof Jazelle DBX technology to a dynamic compilation solution, almost asquick as an interpreter-only solution with Jazelle DBX.
Use of Jazelle DBX by a DAC solution means the compiler can affordto compile less code and interpret more, saving memory withoutadversely affecting performance.
Jazelle DBX can also be used to improve the speed performance of aDAC solution by holding off compilation. This is due to the followingfactors:
1) Holding off compilationmeans methods that are executed very few times will not be compiled,thus saving compilation time.
2) Rather than compile andrun compiled code as quickly as possible and disregard the compiledcode's quality, a DAC can use Jazelle DBX to interpret more and takelonger to compile optimal code, which will execute even faster.
3) Making better use ofmemory means making better use of cache and available memory bandwidth(particularly important, as many systems today still have a 16bit databus).
Similarly, adding Jazelle DBX to a DAC compiler solution means thistechnology can be used successfully on more resource-constrainedplatforms by compiling less or not compiling at all. This reduces ROMrequirements (typically around 33 percent smaller) and RAMrequirements, and allows a single Java solution to be used for manymore devices.
Chris Porthouse is ExecutionEnvironments Product Manager at ARM Ltd.