Advertisement

As Precise as Possible

April 01, 2002

Dan_Saks-April 01, 2002

As Precise as Possible

Both the C and C++ Standards leave gray areas in the language. If you don't know why, those gray areas can look like black holes.

"Everything should be made as simple as possible, but not any simpler."

--Albert Einstein

The C Standard is reasonably precise in specifying the syntax and semantics of C programs. However, it doesn't pin down the exact meaning for every single construct. Every so often, the Standard describes the behavior of some construct as "undefined," "implementation-defined," or "unspecified." (The C++ Standard uses these terms in essentially the same way as the C Standard, so any further mention of "the Standard" applies to C++ as well as C.)

Compilers and linkers catch many erroneous language constructs and reject offending programs before they get to run. Static analyzers, such as lint, catch even more errors at translation time. However, some errors just can't be caught until run time. Unfortunately, run-time error detection often imposes a burden on programs, adding extra code and slowing execution time.

To avoid imposing these costs on all programs, the Standard allows that certain erroneous program constructs simply have undefined behavior. For instance, undefined behavior is what you get when your program de-references (applies the unary * operator to) a pointer whose value is invalid or null. It's also the consequence of dividing any integer by zero. Undefined behavior is effectively a license for each compiler to cope with the error as it sees fit. If a compiler generates code that tries to ignore the error, that's fine. Most compilers do that most of the time. If the compiler generates code that intercepts the error at run time, that's fine, too. Or, if it can detect the error at compile time and issue an error message, that's laudable indeed.

In contrast to undefined behavior, implementation-defined and unspecified behaviors are not the consequences of errors. Rather, they are behaviors for well-formed programs that can vary from compiler to compiler. For example, the number of bits in a char is implementation-defined. The result of converting an integer to a pointer or a pointer to an integer is also implementation-defined.

The difference between implementation-defined behavior and unspecified behavior is simply that each implementation must document its implementation-defined behaviors but not its unspecified behaviors. In other words, implementation-defined behavior is the compiler's license to translate a valid construct as it sees fit (usually within limits imposed by the Standard) as long as it tells you what it did. Unspecified behavior liberates the compiler from even telling you what it did.

The notions of implementation-defined and unspecified behaviors are the Standard's way of granting each compiler some latitude in translating certain language constructs. That latitude allows each compiler to translate these constructs into code that's most efficient on the target platform. It does not include the license to reject valid constructs.

Why does the Standard make the distinction between implementation-defined and unspecified behavior? Why does it let compilers get away with not documenting certain platform-specific behaviors? I've never found a great explanation of why this is so, but the rationale document drafted by the ANSI C Standards committee gives us a hint. It says, "Behaviors designated as implementation-defined are generally those in which a user could make meaningful coding decisions based on the implementation definition."[1]

Documenting every aspect of a compiler's code generation strategy in a form that would be useful to the average programmer on the street is an immense amount of work. Apparently, the committee wisely decided not to burden compiler writers with the entire chore, and let them focus their attentions on aspects of their products that might do more good. The Standard requires documentation only for those platform-specific behaviors that the committee judged to be useful to programmers and reasonably easy to document.

Sometimes, the documentation for implementation-defined behavior takes the form of macro symbols defined in standard headers. For example, every C and C++ implementation documents the number of bits in an object of type char as a macro called CHAR_BIT, defined in the standard headers <limits.h> (for C and C++) and <climits> (for C++ only). These headers define other useful symbols that you can use in combination to determine other implementation-defined behaviors, as in the following example.

C and C++ define three distinct character types: signed char, unsigned char, and plain char. An object of any one of these types must occupy the same amount of storage and have the same alignment requirements as an object of any of the other types. Although it is distinct from the other two types, plain char has the same representation and behavior as either signed char or unsigned char, but that choice is implementation-defined.

The headers <limits.h> and <climits> define, among other things, macros whose values represent the maximum value for each of the character types:

  • SCHAR_MAX, the maximum value for an object of type signed char
  • UCHAR_MAX, the maximum value for an object of type unsigned char
  • CHAR_MAX, the maximum value for an object of type char

Once in a while, you may find yourself writing code that depends on whether plain char is signed or unsigned. You can query these symbols in conditional compilation statements to produce code that will work either way:


#if CHAR_MAX == SCHAR_MAX
// the code here assumes that
// plain char is signed
#else
// the code here assumes that
// plain char is unsigned
#endif

Unfortunately, the standard headers don't provide symbols for all implementation-defined behaviors. You just have to read the compiler manual to find out how the compiler fills in the blanks. Of course, once you know what the compiler does, you can define you own symbols to describe particular behaviors. Here's a situation where you might want to define such symbols.

A compiler may pack consecutive bit-fields of a class, struct, or union into a single addressable storage unit. However, the order of allocation of those bit-fields within that storage unit is implementation-defined. For example, given:


struct foo
{
unsigned int f : 2;
unsigned int g : 3;
unsigned int h : 1;
};
a compiler can place f in the most significant bits of the allocated storage, or in the least significant bits.

Suppose you'd like to compile this struct definition on multiple platforms and guarantee that f is always in the most significant bits of the allocated storage. You can create a header file, say platform.h, with definitions for platform-specific characteristics that you can redefine for each platform. For a target platform that places the first bit-field in the most significant bits, platform.h might contain:

#define FIRST_BITFIELD_IN_MSB 1

For a target platform that places the first bit-field in the least significant bits, platform.h might contain:

#define FIRST_BITFIELD_IN_LSB 1

Then you can write a portable definition for the struct as:


struct foo
{
#if defined FIRST_BITFIELD_IN_MSB
unsigned int f : 2;
unsigned int g : 3;
unsigned int h : 1;
#elif defined FIRST_BITFIELD_IN_LSB
unsigned int h : 1;
unsigned int g : 3;
unsigned int f : 2;
#else
#error unknown bitfield \
allocation policy
#endif
};

The Standard doesn't require compilers to allocate the first bit-field at one end or the other. It simply says that the allocation order is implementation-defined. Other allocation strategies are possible. Thus, this code tests for each known strategy and issues a compile-time error (via an #error directive) if the target doesn't use any of them.

In researching this particular item, I discovered a small but interesting disparity between C and C++ regarding bit-fields. Both the C and C++ Standards agree that the allocation order for bit-fields is implementation-defined. But, while the C++ Standard says the alignment of the storage unit holding the bit-fields is also implementation-defined, the C Standard says it's unspecified.

Once again, implementation-defined and unspecified behaviors offer each compiler some latitude in translating programs. The upside of this freedom is that it allows each compiler to translate these constructs into code that's most efficient on the target platform. The downside is that it can create portability problems. Code that yields correct results on one platform may produce surprisingly different results when compiled and executed on other platforms. I'll look at some of the portability problems in the coming months. esp

Dan Saks is the president of Saks & Associates, a C/C++ training and consulting company. You can write to him at dsaks@wittenberg.edu.

References

1. Hudson Randy, ed. "Rationale for Draft Proposed American National Standard for Information Processing Systems," Programming Language C. 1988.

Back

Return to the April Table of Contents

Loading comments...