Code compression under the microscope -

Code compression under the microscope

ARM, MIPS, IBM, and ARC offer techniques to reduce memory footprint. Here's how code compression works on each architecture.

When it comes to embedded software, smaller is better. Shrinking code size to fit into cost- or space-constrained memory systems is important business these days. It's not uncommon to spend more money on memory than on the microprocessor, so choosing a processor that's thrifty with memory can pay off—big.

Writing tight code is one thing, but a processor's instruction set affects memory footprints as well. No amount of clever tweaking of your C source code will make up for a chip that has lousy code density. If you're concerned about memory usage, it's smart to choose the processor first and then fine-tune the software.

Who's got it, who doesn't
Not every processor has code compression or needs it. Only 32-bit RISC processors need compression, because only they have poor code density to begin with. RISC was designed for general-purpose computers and workstations, with the understanding that memory is cheap. But while memory may be cheap, less memory is cheaper. For makers of cellular telephones and other cost-constrained embedded systems, a $5 difference in RAM or ROM can make a big difference in volume profits. Often, memory size is fixed, and the product's feature set is variable. Tighter object code means more autodial features, better voice recognition, or perhaps clearer images on the screen.

Among 32-bit embedded processors, ARM, MIPS, and PowerPC were the first to develop code-density tricks to reduce their memory footprints. Older 32-bit families, such as Motorola's wheezing 68K and Intel's x86 dynasties, don't need code compression. Indeed, their standard code density is still better than the compressed modes of the RISC chips.

Under my Thumb
We'll start with ARM's code-compression scheme since it's popular, well supported, and typical of what other processors offer. Thumb is pretty straightforward and works well, within limits.

Thumb is really a second, independent instruction set grafted onto ARM's standard RISC instruction set. You toggle between the two instruction sets through a mode-switch instruction in your code. The Thumb instruction set architecture (ISA) is made up of 16-bit instructions, about 36 in all. That's not enough instructions to do much, but Thumb includes the basic add, subtract, branch, and rotate operations. By using these short instructions in place of ARM's normal 32-bit instructions, you can cut the size of some code by perhaps 20 to 30%. But there are a few tricks to it.

First, you can't intermix Thumb code with normal ARM code. Instead, you have to expressly jump between the two modes as if Thumb were a completely different instruction set—which it is. This forces you to segregate all your 16-bit code into independent blocks, separate from your 32-bit code.

Second, because Thumb is a simplified and abbreviated ISA, you can't always do everything you'd like in Thumb mode. Thumb can't handle interrupts, exceptions, long-displacement jumps, atomic memory transactions, or coprocessor operations, for example. Thumb's limited repertoire means it's only useful for basic arithmetic or logical operations. Everything else has to be done using ARM's standard 32-bit instructions.

Thumb limits more than just the instruction set. While in Thumb mode, ARM processors have only eight registers (instead of 16); they can't do conditional execution and they can't shift or rotate operands the way normal ARM code can. Passing parameters between ARM code and Thumb code isn't difficult, as long as they're passed on the stack or through the processor's first eight registers.

Changing in and out of Thumb mode also takes time and, ironically, adds code. A few dozen bytes of preamble and postamble are needed to organize pointers and flush the CPU pipeline. The overhead isn't worth the trouble unless you remain in Thumb mode for several dozen instructions at a time.

Finally, Thumb has odd effects on performance. As a rule of, er, thumb your code will run about 15% slower by using Thumb code. That's caused mostly by the overhead of switching between 16- and 32-bit modes. Thumb instructions are also less flexible than their 32-bit counterparts, so you'll often need more of them to do the work of existing 32-bit instructions. On the plus side, Thumb makes caches more effective because the instructions are only half as long.

If you can work within these limitations, Thumb is like found money. It's included in every ARM processor, so the capability is there whether you use it or not. Most ARM compilers and assemblers also support the Thumb instruction set (and generate the necessary call-in and -out wrappers) so experimenting with Thumb can be pretty painless.

MIPS puts a toe in the water
Once you understand Thumb, MIPS16e is no surprise. MIPS Technologies added a second, 16-bit instruction set to some of its processors that works very much like ARM's system. The MIPS16e instruction set includes a bunch of 16-bit shorthand versions of standard MIPS arithmetic, logic, and branch operations. As with Thumb, you have to switch in and out of MIPS16e mode, a process that incurs some overhead penalty in time and code space. You really only want to switch modes if you can spend a significant amount of time in “compressed” mode. And, as with Thumb, the space savings amount to 20 to 30% for most programs.

Neither MIPS16e nor Thumb really compresses code. They just offer alternative opcodes for some operations, and the amount of “compression” you see depends on the ratio of short opcodes to long ones. That, in turn, depends on what your code is doing. System-level code, like operating systems and interrupt handlers, can't use 16-bit instructions at all, so they won't see any benefit. Pedestrian arithmetic—as long as it doesn't use any large values—should compress pretty well. And don't forget that data doesn't compress at all, only code. Your overall memory savings may be tiny if your application includes a lot of static data structures, and the 15% performance hit may not be worth it. On the other hand, MIPS16e and Thumb are both free(assuming your processor includes them) so it costs little to try them out.

PowerPC CodePack
IBM's approach is, predictably, the most complex of the bunch. Unlike Thumb and MIPS16e, IBM's CodePack system really does compress executable code. It's like running WinZip on your PowerPC software. CodePack analyzes and compresses entire programs, producing a compressed version that has to be decompressed and executed on the fly. For all its complexity, CodePack delivers about the same 20 to 30% space savings as the others.

CodePack is fascinating technology, though. To use it, you compile your embedded PowerPC code in the normal way, using standard tools. CodePack even works on existing code, with or without source code. Before you burn your code into ROM (or load it on disk) you run it through the CodePack compression utility. This analyzes instruction distributions and produces a pair of unique keys specific to this program only. When you run your compressed program, a CodePack-equipped processor uses the keys to decrypt the compressed code on the fly, as it executes. The decompression adds a tiny amount of latency to the processor's pipeline, but its effects are hidden under fetch latency and other natural effects. For most intents and purposes, CodePack's performance effects are negligible.

CodePack has some odd side effects, however. Since every compressed program produces a different set of compression keys, CodePack is essentially an encryption system as well as a compression system. Without the keys, you can't run your own program—but neither can anyone else. If you lose (or simply withhold) the keys, your program is useless gibberish. This also means that compressed PowerPC programs are not binary compatible. You can't just casually exchange compressed programs with other systems unless you also include their decompression keys. This could make software distribution a bit tricky for embedded systems in the field.

By the way, the reason CodePack produces two keys for every program is because the upper and lower 16 bits are compressed separately. IBM engineers discovered that the upper half of each PowerPC instruction (which holds the opcodes bits) has a different frequency distribution than the lower half (which typically holds constants, displacements, or masks). Using two different compression algorithms produced better results than any single algorithm, so that's what CodePack does.

ARCompact—old and new
ARC International (my one-time employer) took yet another approach to code compression. Because the ARCtangent processor has a user-definable instruction set, ARC (and its customers) can make any changes to the instruction set they like. In the case of ARCompact, the company decided to add a handful of 16-bit instructions to improve code density.

Where ARCompact differs from Thumb and MIPS16e, however, is that it can freely intermix 16-bit and 32-bit instructions. With no mode switching, there's no overhead and no penalty for tossing in a few 16-bit instructions here and there. ARC's compilers now emit 16-bit operations by default whenever possible. (You'd want to turn this feature off to force 32-bit alignment of code or for compatibility with older processors.)

ARC can get away with mixing instruction sizes because its ISA is newer than that of either ARM or MIPS. Those RISC architectures (along with PowerPC) have no bit in the instruction word to determine size. Newer pseudo-RISC architectures like ARC's and Tensilica's, and older architectures like x86 and 68K, have such bits. Whether through foresight or accident, variable-length instructions are now paying dividends through tighter code.

Thumb-2 improves on Thumb
Just recently ARM overhauled its code-compression system and released Thumb-2. Despite its name, Thumb-2 is not an upgrade of Thumb—it's a complete do-over and could make Thumb and the original ARM instruction set obsolete. Thumb-2 operates somewhat like ARCompact or Motorola's 68K in that you can mix 16-bit and 32-bit instructions without mode switching. Overall, Thumb-2 offers a little bit less compression than Thumb but with a less of a performance hit.

To accomplish this sleight of hand, ARM needed a hole in its opcode map. It found one in BL (branch and link), the instruction that switches between Thumb and ARM modes. Conveniently, BL has some unused opcode bits; those previously undefined bit patterns now provide an escape hatch into a whole new instruction set. The encoding isn't pretty, but it works.

The biggest advantage of Thumb-2 is that it's a complete ISA. Programs need never switch back to “normal” 32-bit ARM mode. Gone are the limitations of the original Thumb mode; programs can now handle interrupts, set up MMUs, manage caches, and generally behave like real microprocessors.

Thumb-2 still exacts a toll in performance. Even though there's no mode-switching overhead, it still takes more Thumb-2 instructions to perform certain tasks compared with standard ARM code. Those extra instructions (and extra cycles) add up to a 15 to 25% speed penalty, according to ARM.

Future ARM processors may eventually run nothing but Thumb-2 code. Since it effectively replaces both the ARM and Thumb instruction sets with a single, more compact instruction set, why wouldn't it? The question then becomes, what will happen to ARM's software compatibility? Up until now, all ARM processors (with the exception of Intel's XScale) have been binary compatible. Although new processors with Thumb-2 will be able to run older ARM and Thumb code, the reverse is not true. When Thumb-2 becomes widespread it will create a separate but equal software base.

Compress it
We all like options, and every processor's code-compression system is optional. Whether it's for PowerPC, MIPS, or another processor you can choose whether or not to enable code compression. All three major brands trade off some performance in return for smaller code, but with today's fast processors that's a fair exchange. In the case of MIPS16e and Thumb, you can pick and choose which parts of your code to compact and which to leave alone; CodePack is more of an all-or-nothing play.

In all cases, you'll be happiest experimenting with code compression before you commit to using it. Most compression schemes have odd effects on cache and memory performance, and some programs squash better than others. If you can budget yourself some time to experiment, you may get a good return by saving memory costs. esp

Jim Turley is an independent analyst, columnist, and speaker specializing in microprocessors and semiconductor intellectual property. He was past editor of Microprocessor Report and Embedded Processor Watch . For a good time, write to .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.