The previous parts in this series have focused primarily on more general considerations of C types -- things that are portable from one system to another. This article goes deeper into the question of implementation specifics and then discusses the C99 standard and some of the changes it made.
Many platforms offer extensions to the basic C language. For instance,
compilers for PowerPC® processors might feature vector processing extensions
(AltiVec support) and have a range of vector types, such as vector unsigned short, which maps
onto a processor-specific functionality. The degree to which such
operations are made explicit in the type system can vary quite widely.
The C language's type system has been built around the common ground
between implementations, but many implementations have features for which
the standard C type system offers no direct representation. Another example,
found in many DSP processors, is fixed-point math.
In early implementations, a C type always corresponded to some native
capacity of the hardware. The int type was typically the most
convenient native data type for integer math. However, even early on,
some types might not represent hardware capabilities. Some 16-bit CPUs
did not have a native 32-bit type, so they emulated the long data
type with small specialized libraries to perform 32-bit math. Systems with
no 32-bit type have used similar code to handle operations on long long. Until comparatively recently, this applied
to nearly all systems; meanwhile, some compilers are providing support for
128-bit integers, which are once again being implemented in software.
The performance difference between native and emulated types is usually substantial, so it's worth finding out which types are native on your system. In some cases, a type might be mostly native, but need special handling for a few instructions; for instance, a processor which rounds towards infinity on division will need special checks after any division or modulus operation that could potentially involve negative numbers. Other types might be entirely emulated. On some systems, a type you're used to having might require a great deal of compiler overhead. At least one compiler for a 64-bit word-addressable system went to a great deal of work to allow pointers to eight-bit values, which had to be implemented as a pointer plus additional bits, and which were dereferenced by grabbing the word in question and playing bit-shifting games. On such a system, a "clever" algorithm which used character types to access individual bytes within larger words might be dramatically slower than a simpler algorithm operating on whole words using bit shifting and masks.
If you really need a type, and your compiler provides it -- even emulated -- go ahead and use it. Trying to implement it yourself on top of the native types is probably crazy. In particular, the chances that you will do a better job of it than your compiler vendor are very small.
On some systems, the more usual types are the emulated ones. For instance, the SPE processing elements in the Cell Broadband Engine™ (Cell BE) processor have only 128-bit registers, used as vectors. Access to smaller objects requires extra work, either from the developer or the compiler. (For some examples of how a compiler might deal with this, see the developerWorks tutorial, "An introduction to compiling for the Cell Broadband Engine architecture, Part 2: Optimizing for the SPE."
The first implementation of C was for the PDP-11, followed by the
Honeywell 635 and the IBM 360/370. The VAX 11/780, while one of
the most influential early C platforms, was only targeted after
the language had matured quite a bit. On early systems, there were
few guarantees about types: the char,
short, and long
types reflected the native word architecture of the machine, and int was whatever the processor found most
convenient.
Ports to other architectures, such as the 68000 and 80x86, tended to adopt
the same conventions as these early systems. The VAX 11/780 port, in
particular, was very influential.
Although the first C compilers supported 16-bit ints, many early C programmers
(probably influenced by the VAX)
carelessly assumed that the int type was always
interchangeable with the long type,
despite dire and well-considered warnings from C luminaries such as Henry
Spencer, whose 10 Commandments for C Programmers say:
10 Thou shalt foreswear, renounce, and abjure the vile heresy which claimeth that "All the world's a VAX," and have no commerce with the benighted heathens who cling to this barbarous belief, that the days of thy program may be long even though the days of thy current machine be short.
This particular heresy bids fair to be replaced by "All the world's a Sun" or "All the world's a 386" (this latter being a particularly revolting invention of Satan), but the words apply to all such without limitation. Beware, in particular, of the subtle and terrible "All the world's a 32-bit machine," which is almost true today but shall cease to be so before thy resume grows too much longer.
The perils of the "All the world's a VAX" assumptions came
home to roost with the rise in popularity of the Intel 8086 and
its successors, which at first used 16-bit ints. But
in fact, early 386 programming environments could be incompatible, not
only with each other, but with themselves; some compilers offered
the option of choosing whether to use 16-bit or 32-bit values for int
variables. Porting code written by careless programmers on 32-bit systems
could be nightmarish; I once spent roughly a week fixing a SPARC-native
implementation of RPC that had been hacked into running on Microsoft® Windows®; it needed
to run on both 16-bit and 32-bit Windows. If the code had been written
to the standard, instead of to a particular processor, it would have taken an
afternoon at most. (The entire afternoon would have been spent finding the
compiler flag for "generate code which can be called from outside this
library.")
C89 had a brief reference in the description of undefined behavior to indeterminately valued objects. In C99, this wording was clarified and expanded into what are now called trap representations. A trap representation is a set of bits which, when interpreted as a value of a specific type, causes undefined behavior. Trap representations are most commonly seen on floating point and pointer values, but in theory, almost any type could have trap representations. An uninitialized object might hold a trap representation. This gives the same behavior as the old rule: access to uninitialized objects produces undefined behavior.
The only guarantees
the standard gives about accessing uninitialized data are that the unsigned char type has no trap representations, and
that padding has no trap representations.
Pointers which refer to freed memory (or to automatic variables which have
gone out of scope) become indeterminate, and might become trap
representations. This is to accommodate processors on which some amount of
validation of addresses occurs when an address register is loaded. The
indeterminacy of pointers to freed memory is
also, however, very useful for programs which try to provide some level of
checking for possibly buggy code. Any reference to an indeterminate value
is a bug. Having it caught is better than having it silently ignored.
According to comp.lang.c regulars, on some implementations, a pointer to
memory obtained by malloc() and subsequently released by free() might
compare equal to a null pointer. This might seem impossible, but the standard
allows for it. (To the best of my knowledge, it only happens on segmented
architectures, where the compiler can mark an entire segment as "freed space,"
and thus necessarily invalid, and check for this when comparing pointers.)
Some compilers, or development tools, provide pointer implementations which
check for buffer overflows, access to freed memory, and other errors. Some
of their checks, such as checks for access to uninitialized values, conform
with the standard because of the trap-representations rule. Some programs
go further and warn about access to indeterminate values even when they
are accessed as unsigned char objects, which have
no trap representations. Such access isn't undefined behavior, but it's
still nice to get the warning.
Many people are aware that, in general, systems might be "big-endian" or "little-endian." These terms denote the way in which consecutive bytes of data storage are arranged in memory. On a big-endian system, the first byte of a word will be the most significant one, and the last will be the least significant. On a little-endian system, it's the other way around.
Programmers accustomed to one variety or the other might develop bad habits. Big-endian ordering is the canonical byte order for network data, such as TCP/IP packets. Some users on big-endian systems do not remember to convert values to network byte order before putting them in a packet -- which seems harmless until the code is tried on another system. On the other hand, little-endian users sometimes get in the habit of treating a pointer to a word as a pointer to a single byte, to extract the low-order bits. Users on both types of systems often write binary data in machine order without considering the problem of reading the resulting files later.
Some implementations have been "middle-endian" -- where the bytes of a two-byte word were in little-endian order, but the words of a double-word were in big-endian order. Some implementations even support switching modes; for instance, most PowerPC systems can do either little-endian or big-endian math. This is a feature you can't access from portable C, but a library vendor might take advantage of it. Such an architecture is sometimes called "bi-endian" or "open-endian."
Of course, the diversity of type implementations in C wasn't
all arguments about whether int was exactly the
same as short or exactly the same as long. Some systems
with word sizes were not precise multiples of 32 bits. C worked
fine on them, but a lot of code written in C didn't. Among the more
interesting are 60-bit systems (CDC Cyber) which are only addressable by
full words; on such a system, char is 60 bits.
As the computer market has tended towards commodity hardware and more similar designs, the importance of flexibility in the type system has diminished. The C89 standard made fewer guarantees about types, in the interests of protecting compiler writers targeting unusual platforms. The C99 standard gives many more guarantees.
C99 made a number of changes to the C type system. The widespread use of
long long was codified, in a way better integrated
into the type system than it had been in existing practice. However, C99 also
introduced a number of types designed to give users the ability to specify
requirements for types a little better.
First, C99 introduces new types specifically for common user requirements.
The type system problems that came up during the life of
C89, such as disagreements about whether long was guaranteed to
be the largest type, or to be able to hold a pointer, were addressed by
adding types specifically for these purposes -- for instance, an integer large enough to hold a pointer (intptr_t),
or declaring an integer at least as large as any other integer
(intmax_t). These types were added in the new
<inttypes.h> header.
The <inttypes.h> header also provides definitions
of a number of types which might or might not be provided by all implementations,
such as exact-width integer types. In general, the name intN_t
indicates an object with exactly N bits; however, not every system
provides every possible bit width.
C99 generally provides more complete specifications than C89 did. One particular example of the greater amount of specification is the handling of negative operands to the modulus and division operators. In C89, it was up to the compiler to decide whether -3/5 was -1 with a remainder of +2, or 0 with a remainder of -3. This allowed implementations to use their native division and remainder or modulus operators (whichever way they worked), but imposed substantial additional work on developers, who had to use extra code to ensure that they got the particular result their algorithm required. C99 was more willing than C89 to specify such things, and in C99, the result of a division operator truncates towards zero, so -3/5 is 0 with a remainder of -3.
Two major factors are involved: processors are faster (and also more likely, it turns out, to use the behaviors C now specifies), and compiler technology has improved. A modern compiler, even compiling for a system where division doesn't work as specified, is more likely to be able to optimize out the extra tests in cases where they aren't really necessary.
C99 adds support for complex numbers, using three new types, all in the
_Complex type family
(float _Complex, double _Complex, and long double _Complex),
as well as a complete set of operations on them.
When converting a real type to complex, the imaginary part is zero
(positive zero, if it matters), and when converting a complex type to
real, the imaginary part is discarded. The standard floating point arithmetic
operations work on complex numbers.
Complex types must be supported in hosted implementations, but may be omitted in freestanding implementations. Pure-imaginary types are also described, but are optional on all platforms.
C99 introduces a boolean type, _Bool. The
boolean type has exactly two
values: 0 and 1, known also as false and true, respectively.
When a value of any other type is converted to boolean
type, the resulting value is 0 if the other value compared equal to 0, and
1 otherwise. Note that pointers may be converted to the boolean type
without a cast; the result is 1 if the pointer was not a null pointer, and
0 if the pointer was a null pointer. The objects yielded by boolean
operators are not of the boolean type, in contrast to C++; the committee
reasoned that this would potentially affect too much existing code.
Not all systems provide all the new C99 types. If the implementation does not provide a type, the implementation must not define the
corresponding macros specifying its range, called limit macros.
If the implementation does provide a type, it must also provide the corresponding limit macros. Because these macros
exist only when a given feature is available, they are called feature
test macros. For instance, if the intptr_t type is available on an
implementation, the value INTPTR_MAX must also be defined. If the
implementation doesn't provide that type (for instance, if it has pointers
that are too large to be represented in any supported integer type), the
macro must not be defined.
The exact-width types (intN_t) are optional,
although most systems can obviously provide the
exact-width 8, 16, 32, and 64-bit types. On the other hand, the need
for exact-width types is rare. In most code, the use of the "at least" types
is more appropriate; in practice, this means using
short, int, and long.
Note that the types enumerated in the standard are not necessarily an
exhaustive list. An implementation is free to provide int_least128_t if
it wishes, or even int37_t, if it can. The requirement is that the
feature test macros be defined only if the given type actually works
analogously to the standard types; so
for instance, uint11_t must wrap around from 2047 to 0 again,
if it is
provided at all. The intN_t types are required to use a two's complement
representation.
The C89 type system was pretty good. Unfortunately, programmers who had
written code which depended on the knowledge that int and long were
the same type, or that long was exactly 32 bits, made it economically
necessary to introduce a new type for 64-bit integers on some platforms, and
indeed, support for it quickly showed up on nearly all platforms, even those
where the type was emulated.
As a result, the new long long type was introduced. This created a
problem. In the C89 standard as written, long was the largest integer
type. Any value in any (signed) integer type could be stored in an object
of type long. The long long type introduced an exception. This was
not the worst part, though.
In C89, the type promotion rules for constants were that, if an integer
constant was not specifically labeled as being unsigned, it would be
signed if possible. If the value could be represented in an int, it was
of type int. If it was too large for int, but could be represented as
a long, it became a long. If it could be represented only as
unsigned long, then it was an unsigned long. But what should be done
with larger objects? A constant 3 billion (3000000000)
was unambiguously an unsigned long in C89, but
might well be a signed long long in C99. Making
large constants suddenly signed is confusing, but having a region of unsigned
constants sandwiched between two regions of signed constants (0-2 billion and
4 billion and up, roughly) would be pretty bad, too.
Either decision would be surprising to at least some
people.
In C99, the decision was made to standardize the rules so that values promote in a more predictable way. Decimal constants which don't have a suffix indicating an unsigned type are always signed; octal or hexadecimal constants can be either signed or unsigned, and will be of the smallest type large enough to hold the value.
The new rules, especially the formal definition of what makes a type
"wider," are designed to give some insurance against future confusions.
The new intmax_t type offers a type guaranteed to be large enough to
hold any integer value the implementation can represent. This eliminates
the overlap where long was both the only type which
was definitely at least 32
bits, and was also to be the largest type, able to hold any integer value.
Providing a separate type whose only function is to be the largest
type allows implementations to offer extensions, even beyond the size of
long long,
while leaving users a standardized way to store even these numbers.
The intptr_t type is an attempt at the best possible compromise between
programmers' occasional need (or at least desire) to store pointers in
integer variables, and the practical difficulties of doing so portably. Some
APIs have used such a feature extensively, generally in innovatively
nonportable ways.
Even when an implementation provides it, intptr_t
might not be able to store function pointers properly. This is because, on some platforms,
function pointers are more elaborate than standard object pointers, and
there might be no integer type with enough storage to hold one.
The unusual names of the new _Bool and _Complex types are a
reaction to the very real possibility that existing code has defined
its own Boolean and complex types, perhaps using the obvious typedef
names bool and complex. Therefore, the new type names _Bool and
_Complex are selected from the namespace reserved for implementation
use, and should thus not clash with names in existing code.
Only when the corresponding new header files <stdbool.h> and <complex.h>
are included does the implementation supply typedefs to the more obvious
bool and complex spellings.
Thus, the name clash only takes effect when you start including a new C99
header, so existing code isn't broken.
C99 also added variable-length arrays (VLAs) and support for structures whose last member was an array of variable length. These features are both performance and convenience features based on widely implemented features of existing compilers; Part 2 discussed them in more detail.
Learn
-
See the initial
proposal to add "long long" to the standard.
-
This article represents everything you ever wanted to know about C
types. For a gentle introduction to C types, see Types by P.J. Plauger and
Jim Brodie.
-
A much more detailed history of C
was written by dmr, who ought to know.
-
The comp.lang.c
FAQ is full of useful information about C, including the type system,
common pitfalls, and nearly everything else. The book version, which has
additional content, is particularly valuable.
-
Andrew Koenig's paper, C Traps and
Pitfalls, was later expanded into an excellent book of the same name.
-
Henry Spencer's Ten commandments
for C programmers remain topical and relevant today.
-
Learn more of the ins and outs of C programming in the comp.lang.c FAQ
(Frequently Asked Questions) and the comp.lang.c IAC
(Infrequently Asked Questions).
-
See all four articles in this series.
-
Take the tutorial,
"An introduction to compiling for the Cell Broadband Engine
architecture
Part 2:
Optimizing for the SPE."
-
Find more articles of interest in the IBM developerWorks Power
Architecture technology zone.
-
Keep abreast of all the latest Power Architecture-related news, articles,
and downloads: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
-
See all Power Architecture-related downloads on one page.
Discuss
-
Take part in the IBM developerWorks Power Architecture discussion
forums.
-
Send a letter to the editor.

Peter Seebach joined the ISO C committee as a hobby some years ago. His
favorite type is int. He has never had any of his own code fail to run
on 64-bit systems. He would love to hear about errors or omissions in
these articles; contact him at developerworks@seebs.plethora.net.



