Everything you ever wanted to know about C types, Part 3: Implementation details

Enter C99, stage left

The C type system has changed a lot since the 1970s. Part 3 in the Everything you ever wanted to know about C types series reviews some of the quirks that particular implementations have had and discusses the changes the C99 language revision introduced.

Share:

Peter Seebach, Freelance author, Plethora.net

Peter SeebachPeter Seebach joined the ISO C committee as a hobby some years ago. His favorite type is int. He has never had any of his own code fail to run on 64-bit systems. He would love to hear about errors or omissions in these articles; contact him at developerworks@seebs.plethora.net.



19 March 2006

The previous parts in this series have focused primarily on more general considerations of C types -- things that are portable from one system to another. This article goes deeper into the question of implementation specifics and then discusses the C99 standard and some of the changes it made.

Implementation magic

Many platforms offer extensions to the basic C language. For instance, compilers for PowerPC® processors might feature vector processing extensions (AltiVec support) and have a range of vector types, such as vector unsigned short, which maps onto a processor-specific functionality. The degree to which such operations are made explicit in the type system can vary quite widely. The C language's type system has been built around the common ground between implementations, but many implementations have features for which the standard C type system offers no direct representation. Another example, found in many DSP processors, is fixed-point math.

Native and emulated types

In early implementations, a C type always corresponded to some native capacity of the hardware. The int type was typically the most convenient native data type for integer math. However, even early on, some types might not represent hardware capabilities. Some 16-bit CPUs did not have a native 32-bit type, so they emulated the long data type with small specialized libraries to perform 32-bit math. Systems with no 32-bit type have used similar code to handle operations on long long. Until comparatively recently, this applied to nearly all systems; meanwhile, some compilers are providing support for 128-bit integers, which are once again being implemented in software.

The performance difference between native and emulated types is usually substantial, so it's worth finding out which types are native on your system. In some cases, a type might be mostly native, but need special handling for a few instructions; for instance, a processor which rounds towards infinity on division will need special checks after any division or modulus operation that could potentially involve negative numbers. Other types might be entirely emulated. On some systems, a type you're used to having might require a great deal of compiler overhead. At least one compiler for a 64-bit word-addressable system went to a great deal of work to allow pointers to eight-bit values, which had to be implemented as a pointer plus additional bits, and which were dereferenced by grabbing the word in question and playing bit-shifting games. On such a system, a "clever" algorithm which used character types to access individual bytes within larger words might be dramatically slower than a simpler algorithm operating on whole words using bit shifting and masks.

If you really need a type, and your compiler provides it -- even emulated -- go ahead and use it. Trying to implement it yourself on top of the native types is probably crazy. In particular, the chances that you will do a better job of it than your compiler vendor are very small.

On some systems, the more usual types are the emulated ones. For instance, the SPE processing elements in the Cell Broadband Engine™ (Cell BE) processor have only 128-bit registers, used as vectors. Access to smaller objects requires extra work, either from the developer or the compiler. (For some examples of how a compiler might deal with this, see the developerWorks tutorial, "An introduction to compiling for the Cell Broadband Engine architecture, Part 2: Optimizing for the SPE."

Historical implementations

The first implementation of C was for the PDP-11, followed by the Honeywell 635 and the IBM 360/370. The VAX 11/780, while one of the most influential early C platforms, was only targeted after the language had matured quite a bit. On early systems, there were few guarantees about types: the char, short, and long types reflected the native word architecture of the machine, and int was whatever the processor found most convenient.

Ports to other architectures, such as the 68000 and 80x86, tended to adopt the same conventions as these early systems. The VAX 11/780 port, in particular, was very influential. Although the first C compilers supported 16-bit ints, many early C programmers (probably influenced by the VAX) carelessly assumed that the int type was always interchangeable with the long type, despite dire and well-considered warnings from C luminaries such as Henry Spencer, whose 10 Commandments for C Programmers say:

10 Thou shalt foreswear, renounce, and abjure the vile heresy which claimeth that "All the world's a VAX," and have no commerce with the benighted heathens who cling to this barbarous belief, that the days of thy program may be long even though the days of thy current machine be short.
This particular heresy bids fair to be replaced by "All the world's a Sun" or "All the world's a 386" (this latter being a particularly revolting invention of Satan), but the words apply to all such without limitation. Beware, in particular, of the subtle and terrible "All the world's a 32-bit machine," which is almost true today but shall cease to be so before thy resume grows too much longer.

The perils of the "All the world's a VAX" assumptions came home to roost with the rise in popularity of the Intel 8086 and its successors, which at first used 16-bit ints. But in fact, early 386 programming environments could be incompatible, not only with each other, but with themselves; some compilers offered the option of choosing whether to use 16-bit or 32-bit values for int variables. Porting code written by careless programmers on 32-bit systems could be nightmarish; I once spent roughly a week fixing a SPARC-native implementation of RPC that had been hacked into running on Microsoft® Windows®; it needed to run on both 16-bit and 32-bit Windows. If the code had been written to the standard, instead of to a particular processor, it would have taken an afternoon at most. (The entire afternoon would have been spent finding the compiler flag for "generate code which can be called from outside this library.")

Trap representations

C89 had a brief reference in the description of undefined behavior to indeterminately valued objects. In C99, this wording was clarified and expanded into what are now called trap representations. A trap representation is a set of bits which, when interpreted as a value of a specific type, causes undefined behavior. Trap representations are most commonly seen on floating point and pointer values, but in theory, almost any type could have trap representations. An uninitialized object might hold a trap representation. This gives the same behavior as the old rule: access to uninitialized objects produces undefined behavior.

The only guarantees the standard gives about accessing uninitialized data are that the unsigned char type has no trap representations, and that padding has no trap representations.

Pointers which refer to freed memory (or to automatic variables which have gone out of scope) become indeterminate, and might become trap representations. This is to accommodate processors on which some amount of validation of addresses occurs when an address register is loaded. The indeterminacy of pointers to freed memory is also, however, very useful for programs which try to provide some level of checking for possibly buggy code. Any reference to an indeterminate value is a bug. Having it caught is better than having it silently ignored. According to comp.lang.c regulars, on some implementations, a pointer to memory obtained by malloc() and subsequently released by free() might compare equal to a null pointer. This might seem impossible, but the standard allows for it. (To the best of my knowledge, it only happens on segmented architectures, where the compiler can mark an entire segment as "freed space," and thus necessarily invalid, and check for this when comparing pointers.)

Some compilers, or development tools, provide pointer implementations which check for buffer overflows, access to freed memory, and other errors. Some of their checks, such as checks for access to uninitialized values, conform with the standard because of the trap-representations rule. Some programs go further and warn about access to indeterminate values even when they are accessed as unsigned char objects, which have no trap representations. Such access isn't undefined behavior, but it's still nice to get the warning.

Endianness

Many people are aware that, in general, systems might be "big-endian" or "little-endian." These terms denote the way in which consecutive bytes of data storage are arranged in memory. On a big-endian system, the first byte of a word will be the most significant one, and the last will be the least significant. On a little-endian system, it's the other way around.

Programmers accustomed to one variety or the other might develop bad habits. Big-endian ordering is the canonical byte order for network data, such as TCP/IP packets. Some users on big-endian systems do not remember to convert values to network byte order before putting them in a packet -- which seems harmless until the code is tried on another system. On the other hand, little-endian users sometimes get in the habit of treating a pointer to a word as a pointer to a single byte, to extract the low-order bits. Users on both types of systems often write binary data in machine order without considering the problem of reading the resulting files later.

Some implementations have been "middle-endian" -- where the bytes of a two-byte word were in little-endian order, but the words of a double-word were in big-endian order. Some implementations even support switching modes; for instance, most PowerPC systems can do either little-endian or big-endian math. This is a feature you can't access from portable C, but a library vendor might take advantage of it. Such an architecture is sometimes called "bi-endian" or "open-endian."

Weird stuff

Of course, the diversity of type implementations in C wasn't all arguments about whether int was exactly the same as short or exactly the same as long. Some systems with word sizes were not precise multiples of 32 bits. C worked fine on them, but a lot of code written in C didn't. Among the more interesting are 60-bit systems (CDC Cyber) which are only addressable by full words; on such a system, char is 60 bits.

As the computer market has tended towards commodity hardware and more similar designs, the importance of flexibility in the type system has diminished. The C89 standard made fewer guarantees about types, in the interests of protecting compiler writers targeting unusual platforms. The C99 standard gives many more guarantees.


C99 type revisions

C99 made a number of changes to the C type system. The widespread use of long long was codified, in a way better integrated into the type system than it had been in existing practice. However, C99 also introduced a number of types designed to give users the ability to specify requirements for types a little better.

First, C99 introduces new types specifically for common user requirements. The type system problems that came up during the life of C89, such as disagreements about whether long was guaranteed to be the largest type, or to be able to hold a pointer, were addressed by adding types specifically for these purposes -- for instance, an integer large enough to hold a pointer (intptr_t), or declaring an integer at least as large as any other integer (intmax_t). These types were added in the new <inttypes.h> header.

The <inttypes.h> header also provides definitions of a number of types which might or might not be provided by all implementations, such as exact-width integer types. In general, the name intN_t indicates an object with exactly N bits; however, not every system provides every possible bit width.

Stronger specifications

C99 generally provides more complete specifications than C89 did. One particular example of the greater amount of specification is the handling of negative operands to the modulus and division operators. In C89, it was up to the compiler to decide whether -3/5 was -1 with a remainder of +2, or 0 with a remainder of -3. This allowed implementations to use their native division and remainder or modulus operators (whichever way they worked), but imposed substantial additional work on developers, who had to use extra code to ensure that they got the particular result their algorithm required. C99 was more willing than C89 to specify such things, and in C99, the result of a division operator truncates towards zero, so -3/5 is 0 with a remainder of -3.

Two major factors are involved: processors are faster (and also more likely, it turns out, to use the behaviors C now specifies), and compiler technology has improved. A modern compiler, even compiling for a system where division doesn't work as specified, is more likely to be able to optimize out the extra tests in cases where they aren't really necessary.

Complex math

C99 adds support for complex numbers, using three new types, all in the _Complex type family (float _Complex, double _Complex, and long double _Complex), as well as a complete set of operations on them. When converting a real type to complex, the imaginary part is zero (positive zero, if it matters), and when converting a complex type to real, the imaginary part is discarded. The standard floating point arithmetic operations work on complex numbers.

Complex types must be supported in hosted implementations, but may be omitted in freestanding implementations. Pure-imaginary types are also described, but are optional on all platforms.

Hosted and freestanding

The C standard defines two types of implementations. A hosted implementation presumes services of the sort normally provided by an operating system, while a freestanding implementation does not. Much of the standard library is optional on a freestanding implementation, but required on a hosted implementation. Freestanding implementations are typically intended for more bare-bones, embedded environments, and in some cases are given dispensations to omit parts of the more complete type system.

Boolean types

C99 introduces a boolean type, _Bool. The boolean type has exactly two values: 0 and 1, known also as false and true, respectively. When a value of any other type is converted to boolean type, the resulting value is 0 if the other value compared equal to 0, and 1 otherwise. Note that pointers may be converted to the boolean type without a cast; the result is 1 if the pointer was not a null pointer, and 0 if the pointer was a null pointer. The objects yielded by boolean operators are not of the boolean type, in contrast to C++; the committee reasoned that this would potentially affect too much existing code.

Optional types

Not all systems provide all the new C99 types. If the implementation does not provide a type, the implementation must not define the corresponding macros specifying its range, called limit macros. If the implementation does provide a type, it must also provide the corresponding limit macros. Because these macros exist only when a given feature is available, they are called feature test macros. For instance, if the intptr_t type is available on an implementation, the value INTPTR_MAX must also be defined. If the implementation doesn't provide that type (for instance, if it has pointers that are too large to be represented in any supported integer type), the macro must not be defined.

The exact-width types (intN_t) are optional, although most systems can obviously provide the exact-width 8, 16, 32, and 64-bit types. On the other hand, the need for exact-width types is rare. In most code, the use of the "at least" types is more appropriate; in practice, this means using short, int, and long.

Note that the types enumerated in the standard are not necessarily an exhaustive list. An implementation is free to provide int_least128_t if it wishes, or even int37_t, if it can. The requirement is that the feature test macros be defined only if the given type actually works analogously to the standard types; so for instance, uint11_t must wrap around from 2047 to 0 again, if it is provided at all. The intN_t types are required to use a two's complement representation.

Design goals of C99

The C89 type system was pretty good. Unfortunately, programmers who had written code which depended on the knowledge that int and long were the same type, or that long was exactly 32 bits, made it economically necessary to introduce a new type for 64-bit integers on some platforms, and indeed, support for it quickly showed up on nearly all platforms, even those where the type was emulated.

As a result, the new long long type was introduced. This created a problem. In the C89 standard as written, long was the largest integer type. Any value in any (signed) integer type could be stored in an object of type long. The long long type introduced an exception. This was not the worst part, though.

In C89, the type promotion rules for constants were that, if an integer constant was not specifically labeled as being unsigned, it would be signed if possible. If the value could be represented in an int, it was of type int. If it was too large for int, but could be represented as a long, it became a long. If it could be represented only as unsigned long, then it was an unsigned long. But what should be done with larger objects? A constant 3 billion (3000000000) was unambiguously an unsigned long in C89, but might well be a signed long long in C99. Making large constants suddenly signed is confusing, but having a region of unsigned constants sandwiched between two regions of signed constants (0-2 billion and 4 billion and up, roughly) would be pretty bad, too. Either decision would be surprising to at least some people.

In C99, the decision was made to standardize the rules so that values promote in a more predictable way. Decimal constants which don't have a suffix indicating an unsigned type are always signed; octal or hexadecimal constants can be either signed or unsigned, and will be of the smallest type large enough to hold the value.

The new rules, especially the formal definition of what makes a type "wider," are designed to give some insurance against future confusions. The new intmax_t type offers a type guaranteed to be large enough to hold any integer value the implementation can represent. This eliminates the overlap where long was both the only type which was definitely at least 32 bits, and was also to be the largest type, able to hold any integer value. Providing a separate type whose only function is to be the largest type allows implementations to offer extensions, even beyond the size of long long, while leaving users a standardized way to store even these numbers.

The intptr_t type is an attempt at the best possible compromise between programmers' occasional need (or at least desire) to store pointers in integer variables, and the practical difficulties of doing so portably. Some APIs have used such a feature extensively, generally in innovatively nonportable ways.

Even when an implementation provides it, intptr_t might not be able to store function pointers properly. This is because, on some platforms, function pointers are more elaborate than standard object pointers, and there might be no integer type with enough storage to hold one.

The unusual names of the new _Bool and _Complex types are a reaction to the very real possibility that existing code has defined its own Boolean and complex types, perhaps using the obvious typedef names bool and complex. Therefore, the new type names _Bool and _Complex are selected from the namespace reserved for implementation use, and should thus not clash with names in existing code. Only when the corresponding new header files <stdbool.h> and <complex.h> are included does the implementation supply typedefs to the more obvious bool and complex spellings. Thus, the name clash only takes effect when you start including a new C99 header, so existing code isn't broken.

C99 also added variable-length arrays (VLAs) and support for structures whose last member was an array of variable length. These features are both performance and convenience features based on widely implemented features of existing compilers; Part 2 discussed them in more detail.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=106102
ArticleTitle=Everything you ever wanted to know about C types, Part 3: Implementation details
publish-date=03192006