# Floating-point data in embedded software

Although many embedded applications can be implemented using integer arithmetic, there are times when the ability to deal with floating point (real) numbers is required. This article looks at the details of floating point operations, when floating point should and should not be used, some of the pitfalls of its use and how its use may sometimes be avoided.

**Floating Point and Integers**

Nowadays, most embedded systems are built using 32-bit CPUs. These devices give plenty of scope for performing the arithmetical processing required for various applications. Calculations can be performed on signed or unsigned integers and 32 bits gives a good range of values: +/- 2 billion or up to 4 billion respectively. Extending to 64 bits is reasonably straightforward.

If you need to stray outside of these ranges of values or perform more sophisticated operations, then you need to think in terms of floating point and this presents a selection of new challenges.

The concept of a floating point number is simple enough – the value is stored as two integers: the mantissa and the exponent. The number represented is the mantissa multiplied by 2 to the power of the exponent. Typically, these two integers are stored in bit fields in a 32-bit word, but higher precision variants are also available. The most common format is IEEE 754-1985.

The clear benefit of using floating point is the wide range of values that may be represented, but this comes at a cost of extra care when coding and some trade-offs:

**Performance** . Floating point operations take a lot of time compared with integers. If the processing is done in software, the execution time can be very long indeed. Hardware floating point units speed up operations to a reasonable extent.

**Precision** . Because of the way that values are represented in floating point, a value may not be exactly what you expect. For example, you may anticipate a variable having the value 5.0, but it actually is 4.999999 This need not be a problem, but care is needed in coding with floating point.

**Coding with Floating Point**

Because of the intrinsic lack of absolute precision in floating point operations, code like this would clearly be foolish:

if (x == 3.0) ...

as **x** may never be precisely 3.0.

Similarly, coding a loop like this might produce unexpected results:

for (x=0.0; x<5.0; x++) ...

You would expect the loop to be performed 5 times for **x** values 0.0, 1.0. 2.0, 3.0 and 4.0 This might work, but it is quite possible that an extra iteration will occur for **x** being 4.999999.

The solution is to use an integer loop counter:

for (i=0,x=0.0; i<5; i++,x++) ...

**Binary Floating Point**

Most embedded developers understand the binary representation of numbers – even if many are less than 100% comfortable with its use on a daily basis. Binary integers are easy enough: the digits, going from right to left, represent 2^{0} , 2^{1} , 2^{2} , 2^{3} and so on. So, the number 10011 is 1 + 2 + 16 = 19.

It is less common to see floating point numbers represented in binary, but just as useful to understand how they work. The left of the binary point (not the decimal point now!) – the whole part of the number – is represented like a binary integer. That is obvious. The more confusing part is to the right of the binary point – the fractional part of the number. Here the digits represent 2^{-1} , 2^{-2} , 2^{-3} , 2^{-4} (^{1} /_{2} , ^{1} /_{4} , ^{1} /_{8} , ^{1} /_{16} …) and so on. So, the number 0.11011 is 0.5 + 0.25 + 0.0625 + 0.03125 = 0.84375

If you want to play with such numbers, here is a simple C function to display a float (which is less than 1) in binary:

void showbinaryfp(float x){ float y = 0.5; if (x >= 1.0) return; printf("0."); while (x > 0.0) { if (x >= y) { printf("1"); x -= y; } else { printf("0"); } y /= 2.0; } printf("n");}

**IEEE 754-1985 Format**

There are countless waysthat floating point numbers might be represented in a computer. Forexample, the binary point could be located at an arbitrary point in a32-bit word and the binary pattern of digits interpreted accordingly.So, if the bottom 8 bits were to be designated the fraction, the value **0x00000280** would represent 2.5 (in decimal). Here it is in binary (I included the binary point):

**0000 0000 0000 0000 0000 0010 . 1000 0000**

However, this approach, whilst superficially simple andstraightforward, is rather inflexible and limits the range of valuesthat may be represented. Mathematicians and scientists use a format forfloating point numbers that is commonly called “Scientific Format”,where the value is represented by a value greater than or equal to 1.0but less than 10.0 (the mantissa) is multiplied by a power of 10 (theexponent). So, 1234 would be shown as 1.234 x 10^{3} – commonlywritten 1.234e3 The same approach is normally used for computerrepresentation of floating point, except that the mantissa is normallyless than 1.0 and the exponent is a power of 2.

Historically, when all floating point operations were performed bysoftware, a wide variety of format variations were in use, which weredefined by computer manufacturers and compiler developers. Hardwarefloating point followed the same path initially, with each manufactureroffering their own variant. Nowadays, however, floating point format isstandardized and IEEE 754-1985 is used almost universally. Again, it isuseful to understand the principles of floating point formats, even ifthis knowledge is not exploited every day.

The standard describes both a single-precision (32-bit) and adouble-precision (64-bit) variants. The discussion here will be confinedto single-precision; double-precision uses exactly the same ideas.

A floating point value is represented by three fields: sign (1-bit), exponent (8-bit) and the mantissa (23-bit).

To formulate a number, the fields are employed as follows:

- The mantissa field is set to a value so that there is always a 1 before the binary point, so this is omitted, thus gaining an extra bit of precision. (The value 0 is an exception.)
- The exponent is set to a value which is the necessary exponent plus a bias of 127.
- The sign is set to 0 for positive numbers or 1 for negative.

For example, 14.25_{10} is 1110.01_{2} . This can be rewritten 1.11001_{2} x 2^{3} . So, the floating point fields are assigned:

- The mantissa is .11001 after removing the leading 1.
- The exponent is 3 + the bias of 127, which is 130 (10000010
_{2}). - The sign is 0 as the number is positive

The resulting floating point representation looks like this:

**0 | 1000 0010 | 110 0100 0000 0000 0000 0000**

If this 32-bit value were displayed in hex, it would be **0x41640000** .

**Conclusions**

Broadly speaking, floating pointshould only be used if it is essential and only after every creative wayto do the calculations using integers has been investigated andeliminated.

**Colin Walls**has over thirty years experience in the electronics industry, largelydedicated to embedded software. A frequent presenter at conferences andseminars and author of numerous technical articles and two books onembedded software, Colin is an embedded software technologist withMentor Embedded [the Mentor Graphics Embedded Software Division], and isbased in the UK. His regular blog is located at: http://blogs.mentor.com/colinwalls. He may be reached by email at

“That's what caused the Pariot Misslie bug which killed 28 people. http://www.ima.umn.edu/~arnold//disasters/patriot.htmlnn“

“I had forgotten about that disaster. Excellent example to illustrate my point. Thanks.”

“you may know Fixed Point Typesnhttp://www.ada-auth.org/standards/12rm/html/RM-3-5-9.htmlnnyou can leave the work to the compiler”

“Nice article. nnI hope I'm not being too picky, but just a couple of minor technical details (maybe highlighting the whole point about being careful with floating point):nn * The programming example was slightly unfortunate: for (x=0.0; x<5.0; x++

“@LTrammel – I concur with all your points.”

“If you are writing floating point code for embedded systems, be /very/ careful about your floating point literals. If you write 5.0, then you have a double-precision “double”. If you want a single-precision “float”, then write 5.0f. On a microcon