What can you build with an integer? - Embedded.com

What can you build with an integer?

Click here for response to this article.

In praise of the lowly integer, this article shows how to make code faster and smaller by using simple counting numbers instead of floating-point math, strings, or arrays. These techniques should serve as a guide for every programmer.

A modern software engineer uses a huge variety of data structures and types: objects and classes, tables and record sets, strings and floating-point numbers. At the same time, simplicity is a virtue in creating software that's economical, efficient, and reliable. In this article I suggest that one simple data type, the integer, can be used to create a large variety of useful data types and some data structures. This article says nothing that most readers don't already know; consider it a reminder of how much you can accomplish with simple tools, how much you can do with an integer.

These techniques can enhance storage and processing efficiency in the computational part of your program. You shouldn't focus on these techniques for improving global program efficiency, however, because disk I/O, network I/O, database accesses, GUI operations, and operating system calls will usually be your main sources of bottlenecks and deserve your initial attention in overall program optimization.

Varieties of integers
Four sizes and two kinds of integers are directly supported by many processors and programming languages, producing eight integer data types that you should understand, all listed in Table 1.

Table 1: The eight common integer data types (language-independent)

Description Size Range
byte ordinal 1 byte, 8 bits 0 to 255
short ordinal 2 bytes, 16 bits 0 to 65,535
ordinal 4 bytes, 32 bits 0 to 4,294,967,295
long ordinal 8 bytes, 64 bits 0 to 18,446,744,073,709,551,615
byte integer 1 byte, 8 bits -128 to 127
short integer 2 bytes, 16 bits -32,768 to 32,767
integer 4 bytes, 32 bits -2,147,483,648 to 2,147,483,647
long integer 8 bytes, 64 bits -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Table 1's list of types describes unsigned integer types, which don't allow negative numbers, as ordinal types. The C and C++ languages describe such types as unsigned .

A signed integer type on any modern computer uses a two's complement representation for negative numbers: represent –n as (~ n ) + 1, where ~ is the bitwise-not operator. A two's complement representation is slightly asymmetric around zero; the smallest negative number has a magnitude of one more than the largest positive number. In the history of computing, two other representations have been used for negative numbers: one's complement (symmetric with both a positive and negative zero) and signed magnitude , where an integer's sign and magnitude are stored in separate fields.

An n -bit ordinal type can represent integers from 0 to 2n -1.

An n-bit signed integer type can represent integers from -2(n -1 ) to 2(n -1) -1 using a two's complement representation.

As hinted by the names in Table 1, the most common size of integer used in programming today is 32 bits, corresponding to the widespread use of 32-bit registers and address spaces by today's programs and processors. The personal computer industry made the transition from 16 to 32 bits many years ago. Some high-end computer systems are now using 64-bit processors, but 32-bit address spaces should continue to be sufficient for running most applications. The old term word size is less useful today, when even a 32-bit processor may read and write 32-byte (256-bit) cache lines and access its external DRAM in 64-bit or larger bursts.

Some programming languages provide only some of these integer types. Java, for example, has all four signed integer types but only 16-bit unsigned integers. In standard C and C++, the sizes of standard integer types are historically implementation-dependent, though with properties that can be accessed via standard header files such as . With a little effort, you can define ordinal or integer types of your desired sizes and alias them to the supported types for your compiler or environment.

A more pernicious problem with integer operations in C or C++ is that overflows and other errors in fundamental arithmetic operations often go undetected, especially with unsigned types. You can implement your own classes for “safe” ordinal or integer arithmetic, but even inline implementations will have significant performance penalties.

Counting numbers
Humans (and other animals to a limited extent) first discovered integers for counting. An extensive description of that process can be found in Georges Ifrah's comprehensive book, The Universal History of Numbers: From Prehistory to the Invention of the Computer (New York, NY: John Wiley & Sons, 2000).

Counting requires only ordinal numbers. Even a byte value can count the members of many useful sets. For instance, the number of dependents you declare for tax purposes, how many telephone numbers you have, how many motor vehicles you own, and so on.

Software normally counts by initializing a count variable to zero and then incrementing it each time a member of the set being counted is encountered.

Encoding sets
Suppose your business software application deals with several benefit types: “retirement,” “disability,” or “severance.” The software can use character strings and variables in many places throughout the program, comparing those string values when executing conditional logic that depends on benefit types.

As an alternative, your program can encode a benefit type as an integer. A single constant array then provides a lookup table when the corresponding string value is needed for outputs to users. Modern programming languages provide enumeration types for this purpose, though obtaining the equivalent string values is not supported in every programming language. Software-defined encoding allows you to use the same codes in databases and in programs, to add additional enumerated values to categories at run-time, and to support multiple natural languages.

Using integers instead of strings to distinguish among a small set of values improves both storage and execution efficiency by one or two orders of magnitude. The improvement in execution efficiency results from avoiding dynamic memory allocation and function calls for such operations as creating, assigning, and discarding strings.

Character and operation codes
Two other ubiquitous applications for encoding with integers are character sets and operation codes. An n -bit ordinal can encode 2n possible values. Character codes are typically either 8-bit codes that are supersets of ASCII or the newer Unicode characters that can encode any of the world's major scripts, along with many minor written alphabets.

An instruction set, whether for a real processor or a virtual machine, is typically encoded by assigning an ordinal opcode to each possible operation, such as ADD, MOVE, and so on. Other ordinal codes identify machine-supported data types, addressing modes, registers, or literals. Most instruction sets are so densely encoded that several variant instruction formats containing multiple bit fields are used to define the set.

Member identifiers
Consider the automobile license plate. The value of a license plate, say, X2A449, is in its uniqueness, its presence, and its relative stability. No other car on the road (within the same jurisdiction) will have the same license plate value; you can't drive your car without a license plate (it's intended to be a not null field); and you can't change the license plate on your own without updating your jurisdiction's vehicle-registration database. A license plate is a good identifier, which it's designed to be.

When generating a row identifier or a key value in a database, integer fields are normally used. Some database products provide special support for identifiers that generate a new unique integer value when a new row is inserted. Using the typical 32-bit signed integer data type and only using identifier values greater than zero allows more than two billion identifiers.

When you use integers as identifiers, your program normally shouldn't use those values for arithmetic because the values don't have arithmetic meanings. If you numbered your children 1, 2, and 3, you shouldn't assume that adding Child 1 to Child 2 would produce Child 3.

Integer addressing
A computer's address space is an array of storage locations, each containing an unsigned integer and each addressed by a distinct unsigned integer. Most modern processors use byte-addressed address spaces. Larger units, each normally 2n bytes for some ordinal n > 0, are often used to read and write memory or manage paged virtual memory. Aligning these larger units on their natural 2n byte boundaries normally provides better performance—and is required by some processors.

Addressing, addressing modes, and address arithmetic, all using integers, are fundamental to most data structures, such as indirection for managing pointers and links, scaled indexes for accessing arrays, and offsets for accessing structure elements. Associative access to data, based on the values stored, is the main form of access that has little directly to do with addressing but more with the design of hash tables or indexes.

Integers are used for addresses, pointers (the union of a subrange of addresses and a null value), for indexes into array dimensions, and for computed offsets to structure members. When addressing within a byte or ordinal, integers indicate bit positions.

Booleans and bit sets
A Boolean value (for example, a truth or logical value) is normally represented using the integer 0 for false and 1 for true. In some languages, such as Microsoft Visual Basic 6.0, true is represented as -1. A Boolean interpretation can be applied to other integer values; in C and C++ an arbitrary integer is treated as true if it's nonzero and false if it is zero. Some other programming languages do this as well.

A drawback to using a built-in Boolean type can be storage inefficiency when a collection of Boolean values is needed. For example, a C++ compiler may represent the standard bool type as a 32-bit integer. If you need an array of many Booleans, you can reduce the memory required by a factor of 32 by using one bit for each Boolean value and packing 32 (or 16 or 64, depending on your processor) Booleans into each storage unit manipulated. The C++ language provides a bitset template and a vector specialized template that normally will provide such packing for you.

When a set is defined over a type with a small number of values in the type, bit sets contained in ordinals are the most efficient representation for storing and manipulating set values. For example, if you have Fred's schedule, Holly's schedule, and Michael's schedule for next month represented as a bit set indicating their days at work, you can schedule a meeting with all three employees by ANDing the three values and seeing which days are in the resulting set.

Money or decimal types
At first glance, the value of $3.23 + $1.07 = $4.30 is a non-integer value. But in units of cents instead of dollars the calculation becomes 323 + 107 = 430.

There is a tremendous advantage to using integers for calculating with money amounts: integers are exact. Using binary floating-point numbers to carry out exact fractional decimal calculations produces small errors that must be rounded off and eliminated with great care, either by the application programmer or preferably (though with some overhead) within the implementation of a class that provides the desired function.

Microsoft Visual Basic 6.0 provides a 64-bit currency type that provides four fractional decimal digits, with an underlying long integer representation. Four fractional digits provides flexibility in representing other currencies that may not use only two fractional decimals, or in representing the occasional dollar value that includes mills as well as dollars and cents (as in $2.139 per gallon).

For calculations with U.S. dollars and cents, even 32-bit integers can be used to implement many complete applications. A low-end personal finance program can limit its users to amounts of no more than $20,000,000 without significantly impairing its usability.

To implement a fixed decimal type with n fractional digits, use a scale factor of 10n .

Dates and times
If you're not using a programming language that provides the built-in date or time types that you need, you may need to implement such classes or libraries yourself.

A 32-bit ordinal can easily contain any Gregorian date in the range 1/1/100 to 12/31/9999, allocating 16 bits for the year, 8 bits for the month, and 8 bits for the day of the month. If you want to avoid the “Y10K bug” and use bit fields within a 32-bit integer, you can allocate 23 bits for the year, 4 bits for the month, and 5 bits for the day of the month. The latter representation allows years from about 4 million B.C.E. to 4 million years in the future.

A 32-bit ordinal used for time-of-day information can contain 7 bits to indicate a time zone, 5 bits for the hour, 6 bits for the minute, 6 bits for the second, and 7 bits for hundredths of a second, leaving one bit unused.

The combination of the date and time of day can store any time, throughout the entire available date range and including time zone information, in 64 bits.

Computer graphics
A 32-bit ordinal can hold a pixel value for the greatest color depths in widespread use: 24-bit color with 8 bits each of red, green, and blue, along with an 8-bit alpha channel that can carry transparency information for blending images in different layers.

A 32-bit ordinal can also hold pixel coordinates in the x-y plane, 16 bits for an x coordinate and 16 bits for a y coordinate: the 65,536 possible coordinate values in each dimension are ample compared to today's typical screen resolutions (640 by 480 up to 1,600 by 1,200).

Many graphics libraries and graphics cards use single-precision floating-point numbers for coordinates; you probably don't want to choose representations for graphics data that aren't compatible with the representations used by lower-level software and hardware.

Cryptographic keys
Modern cryptographic systems that use public keys are based on large unsigned integers, two large prime numbers forming a key pair. The size of integer used can range from 64 to 1,024 bits.

Once a secure communication session is established using public-key cryptography, symmetric block ciphers are often used for performance reasons. Such ciphers manipulate blocks ranging in size from 64 to 256 bits. A block can be viewed as a bit set manipulated using bit operations on ordinals.

While cryptographic operations are based on integers, the integers used are so large that this particular application of integers doesn't have the same simplicity and efficiency as many of the other applications described in this article.

Packed structures
When you lay out a structure (struct), record, or class in a high-level language, each member is normally allocated as a separate storage unit, ranging in size from one to dozens of bytes or more. As already shown for dates, times, and graphics information, multiple data elements can easily be packed into one integer. The C and C++ languages support bit fields for such packing. For example, the C++ struct definition shown in Listing 1 packs 10 useful pieces of information about a guest into 32 bits.

Listing 1: C++ struct definition collecting 10 useful pieces of information about a guest

struct GuestPreferencesType {

unsigned int handedness : 2; // 0 not known
// 1 right-handed
// 2 left-handed
// 3 ambidextrous
unsigned int smokingPref : 2; // 0 not known
// 1 non-smoker
// 2 cigarettes or cigars
// 3 pipes
unsigned int caffeinePref : 3; // 0 not known
// 1 no caffeine
// 2 tea
// 3 coffee
// 4 regular cola
// 5 diet cola
unsigned int caffeineAdtnPref : 3; // 0 not known
// 4 black
// 5 cream
// 6 sugar
// 7 cream and sugar
unsigned int mornEvePerson : 3; // 0 not known
// 4 normal
// 5 gets up early
// 6 stays up late
// 7 gets up early and stays up late
unsigned int sabbathPref : 3; // 0 not known
// 1 Moody Blues concerts
// 2 no sabbath
// 3 Friday
// 4 Saturday
// 5 Sunday
// 6 Bahai (every nineteen days)
unsigned int gender : 2; // 0 not known.
// 1 female
// 2 male
unsigned int birthYr : 8; // ( – 1880 rep.)
// zero if not known.
unsigned int politicalPref : 4; // 0 not known
// 1 independent
// 2 Democrat
// 3 Republican
// 4 Green
// 5 socialist
// 6 Marxist
// 7 Libertarian
// 8 Reform
// 9 fascist
// 10 technocrat
// 11 Georgist
// 12 anarcho-syndicalist
// 13 anarcho-capitalist
// 14 Likud
// 15 Islamic Jihad
unsigned int maritalSts : 2; // 0 not known
// 1 single
// 2 married
// 3 divorced or widowed

};

Switch to integers
By using integers instead of strings you can both improve performance and reduce storage by one to two orders of magnitude. By using bits or bit fields instead of full-sized integers, you can sometimes save another order of magnitude in storage, with likely no savings and some cost in processing.

While many modern processors have floating-point units and some have special string instructions, general-purpose processors are first and foremost designed to operate on integers. By writing integer-oriented code, you'll write code that can fetch, compute, and store its data in nanoseconds.

The low-level techniques described here don't contradict abstract and object-oriented programming, but improve such programming. The encapsulation that classes and objects provide gives you more, not less, freedom to use efficient data representations because only the code within a class needs to change if representations change.

I recently bought and installed an early version of the game Doom onto a new PC. To my surprise and delight, this complex “first person shooter” with intrica

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.