 | Level: Introductory Peter Seebach (developerworks@seebs.plethora.net), Freelance author, Plethora.net
03 Jan 2006 This article, first in a four-part series, introduces the basics of the C type system, with an overview of what it means to talk about type and a discussion of the basic types (integer and floating point) in some detail.
The C language provides a fairly elaborate system for data types.
This series provides an overview of the C type system, as well as some
detailed discussions of some of the musty corners people don't normally
know about. It is worth learning. As Henry Spencer wrote:
A programmer should understand the type structure of his language,
lest great misfortune befall him. Contrary to the heresies espoused by
some of the dwellers on the Western Shore, `int' and `long' are not
the same type. The moment of their equivalence in size and
representation is short, and the agony that awaits believers in their
interchangeability shall last forever and ever once 64-bit machines
become common.
The C type system has been adapted to a great number of architectures.
As C was adapted to new systems, decisions had to be made. Should the
int type be the same size on every new system,
or should it be the most convenient size on every system, even if this
meant it wasn't always the same size?
If you learn nothing else from these articles, learn this: On any C
language implementation compliant with any C standard ever written,
sizeof(char) is exactly one, whether char is eight bits, 16, 60, or 64.
If you use the GNU autoconf test for sizeof(char), you might as well
tattoo "I don't know what sizeof means" on your forehead.
The first article of the series introduces the type system itself,
explaining the basic types and the system of type qualifiers and
storage-class specifiers.
What do we mean by a type?
The type of an object in C describes the way in which a chunk of memory
is associated with a value. In many cases, a type describes a computer's
native ways of representing values. For instance, on a typical UNIX®
system, declaring a variable as int will reserve enough storage to hold
a single processor register. This variable will be manipulated by
processor-native instructions, and will have the semantics native to the
processor.
C is generally considered a strongly typed language (although some
dispute the exact terminology). C compilers can catch many common or
likely errors based on the types of variables and expressions. Types
include both data representation and operations; for instance, division
isn't the same on integer and floating-point values.
One of the most fundamental aspects of a type is its size. In C, size is
measured in units of unsigned char, and returned as a value of type
size_t, which is some unsigned integer type; the size of a type is the
number of unsigned char objects it would take to hold all the bits used
to store the object. The built-in sizeof operator yields this size.
This is often referred to, quite correctly, as the "size in bytes," but it
is important to know that, in C, a /byte/ is "an object of type unsigned
char." On a system where unsigned char is larger than eight bits,
sizeof(char) is still always one, and everything else is counted in terms
of the actual number of bits in a char, not in terms of octets.
There is more to a type than just a number of bits. There are multiple
possible data representations, although the C standard mandates a /binary/
representation for integers.
A binary representation is one in which a number consists of a series of
bits, each of which is either on or off, whose values are powers of two,
and are added together. For positive values, this is easy enough to
figure out; there are bits representing the values one, two, four,
eight... up to the largest bit that represents a positive value.
The question of how to represent negative numbers is more complicated.
A conforming C implementation can represent
negative numbers in three ways: two's complement, ones complement, and sign-magnitude.
All three are permitted, but the vast majority of modern systems use two's
complement. (The punctuation of the complement types may seem odd; the
ones' complement is a complement of plural ones, while the two's
complement involves a single power of two being flipped.) More arcane
options, such as Gray code, are not allowed. (An implementer wishing to
target such a machine would have to go to some lengths to hide the
underlying bit representation.)
In sign/magnitude, the top bit represents sign, and everything else is
unchanged. A bit pattern that represents four always represents a number
with a magnitude of four; the sign bit determines whether it's positive or
negative. In ones complement, negative numbers are handled differently: a
given negative value is represented by inverting all the bits from the
positive value. All ones represents negative zero.
The two's complement system is different. One way it is often explained
is to say that you invert everything, then add 1. So, with an eight-bit
value, if you want to write -127, you start by writing 127 (01111111),
invert everything (10000000), then add 1 (10000001). Another way to think
of it is to simply see the top bit as having a negative value, and all the
other bits as still having the same values they always had. So, for
instance, 10000001 in two's complement is -128 + 1, or -127. This is
particularly intuitive if you're already used to performing bitwise
operations. On a signed two's complement 32-bit number, the top bit has
the value -2,147,483,648.
The main thing that affects users directly is that, on a two's complement
system, the negative range is one larger than the positive range. A
signed 16-bit number can represent the range from -32,768 to +32,767. On
ones complement or sign-magnitude systems, the range would be +/-32,767,
and there would be two representable 0 values. This is why the
standard-imposed minimum ranges for signed types are symmetrical, even
though the actual limits on many systems are asymmetrical.
As a side note, the asymmetry makes for a common pitfall in attempts to
read numbers in. If you handle negative numbers by calculating their
magnitude, then negating, your code may behave surprisingly on the largest
negative number it can represent.
 |
Complimentary signs
If you want to know how the various systems actually represent values,
here's what the range of values would look like on a three-bit system.
Raw Value Value Value
Bits (SM) (1C) (2C)
000 0 0 0
001 1 1 1
010 2 2 2
011 3 3 3
100 -0 -3 -4
101 -1 -2 -3
110 -2 -1 -2
111 -3 -0 -1
|
|
|
In addition to the question of storing values, there's also the question
of other properties. For instance, the behavior on overflow (when a
calculation exceeds the range of values a type can represent) can differ
from one system to another. C mandates predictable behavior for overflow
on unsigned types, but not on signed types.
Another avenue of potential variance is handling of division. The rules
for division and remainder on positive numbers are fairly well defined.
C89's requirements for division and remainder on negative numbers were
simply that (x/y)*y+(x%y)==x. This leaves open an ambiguity. It's
obvious that -5/5 is -1. What's not obvious is how to handle -3/5.
Should it be -1, with a remainder of 2, or should it be 0, with a
remainder of -3? Both results have been generated by real-world
processors. In C99, the decision was made to require truncation towards
0; -3/5 is 0, with a remainder of -3. This imposes additional work on
some processors. (Part 3 of this series comes back to this topic.)
In many cases, it is possible to convert a value from one type to
another; this is called casting. To convert a value in C, you put a type
name in parenthesis before the expression; this syntax is called a /cast/.
For instance, (float) 1 converts the integer constant 1 to type float.
Not all types can be converted freely, although most numeric values can be
converted from one type to another.
Floating point math
C provides both integer types and floating point types. The floating
point types are particularly variable in range and behavior. In general,
floating point types have greater range than integers of the same size,
and can represent non-integer values. Generally, this is accomplished by
having a mantissa (a value) and an exponent. Floating point numbers are
expressed as some integer multiplied by some power of two: a positive
power of two for large numbers, and a negative power of two for small
numbers.
As a result, not all numbers in the range of a floating point type can be
represented exactly; for instance, 1/3 cannot be represented exactly, and
precision gets worse as numbers get larger: a 32-bit system may be able to
represent 8,388,607.5 but not 8,388,608.5. (Why? Well, because 8,388,608
is 2^23.) Most users are bitten at least once by the inability to
represent 1/10 exactly in binary floating point, and many have run into
the range of numbers (on a 32 bit system, starting around 16.7 million)
where even data to the left of the decimal point can't be represented.
Floating point mathematics introduces additional considerations that
don't apply to integer arithmetic. For instance, because floating point
arithmetic is approximate, the distributive and associative laws don't
always apply. (The same is true in some cases of pure integer math as
well.) There are more subtle differences. For instance, the standard
permits floating point operations to be performed with greater precision
or range than the types used for the operands would allow. This can, in
some cases, change the results of an operation from what they would be if
it were really performed entirely in the specified type.
Floating point types can come in a variety of types, sizes, and
representations. The C99 standard introduced a number of new features for
floating point mathematics, including support for the IEEE 754
specification for floating point values and operations. Part 3 discusses these features in greater detail.
Minimum maximums
The C standard does not specify a number of bits for a type, but a
required range. The ranges are specified in terms of what the smallest
acceptable magnitude for the limits is. For instance, the int data type
must be able to represent numbers as large as 32,767, and as small as
-32,767. As it happens, this requires 16 bits of storage at a minimum.
On an actual machine, the int data type may be larger than 16 bits, but
not smaller. The unsigned char data type must be able to represent
values from 0 to 255, so it must be at least eight bits. Finally, the long
type handles ranges of at least +/-2,147,483,647, and the long long type
handles ranges of at least +/-9,223,372,036,854,775,807.
Because C allows for machines that don't use two's complement arithmetic,
the limits required for signed types are one smaller than the limit that
would be native on a two's complement system.
Compatible types
Sometimes two or more seemingly distinct types will have the same
underlying representation. For instance, char always has exactly the
same representation and storage as one of either signed char or
unsigned char -- but it is still a distinct type. Similarly, on a very
large number of implementations, int has the same representation as one
of either short or long -- on a large number of implementations, but
not on all. The assumption that this was always the case was typically
correct on 16-bit and 32-bit architectures, but would generally be a poor
implementation choice on 64-bit architectures. This, in fact, is why the
types are still considered distinct. A compiler which fails to warn you
that you've mistakenly mixed int and long, because it knows that they
happen to share a storage format on this particular platform, is not doing
you any favors.
The char types are particularly strange in this respect. Some early
compilers had plain char signed by default, because that's how other
types worked, while others made it unsigned by default, because this was
more convenient to developers in many cases. The debate over whether
plain char should be signed or unsigned ended in a draw. If you want to
talk about strings, use plain char. If you want to treat memory as an
array of bytes, use unsigned char. The signed char type is useful
only when you want to use char objects as tiny, signed integers.
Storage-class specifiers
Storage-class specifiers tell the compiler where or how a given object is
stored. The storage-class specifiers are auto, extern, register,
and static. Each of these tells the compiler something about how an
object is stored.
The auto storage-class specifier requests storage that is local to a
function's invocation (instantiated when the function is called and
deleted when it returns), and duplicated (as necessary) during recursive
calls. auto is the default storage class for objects declared within a
function, and it cannot be used on objects declared outside of a function, so
it is never actually necessary, and most programmers consider the auto
keyword obsolete and never use it explicitly at all.
As the name implies, the register storage-class specifier is a
suggestion to a compiler that the object in question might best be stored in a processor register. Depending on register availability, this
assignment cannot be guaranteed, of course, but the compiler is supposed
to take it as a very strong hint. Since the object might be stored
in a register, and thus have no address, a programmer must not attempt to
take the address of an object which was declared with the register
storage class. Like auto, register cannot be used outside a function.
The extern storage class indicates that an object is being declared,
but not defined; the object itself will come from some other module, and
the extern declaration merely gives the compiler enough information to
generate code which refers to that object. An object declared within a
block, with the extern storage-class specifier, is a reference to a
global variable.
The static storage-class specifier has a variety of meanings which may
seem bewildering to the typical user. For the most part, it really does
only one, compound, thing: It defines an object to have permanent
storage, but not be visible outside the scope of its declaration.
For an object declared at file scope (outside of any function), the
permanent storage is the typical behavior, so the only effect static has
is to cause the object to be hidden outside the particular file it was
declared in. For an object declared within a function, being invisible
outside its scope is the typical behavior, so the only effect static has
is to reserve permanent storage for the object.
However, it's not a new C standard without a new meaning for the keyword
static. The new meaning for static in C99 is that an array argument
to a function may be declared with a size modified by the static
keyword.
This is an optimization hint: the compiler is promised that every call to
the function will provide an array with at least that many elements. For
instance, the following function declaration specifies a function which
must be called with an array of 16 or more elements.
Listing 1. A new meaning for static
void foo(int a[static 16]);
|
This is aimed at compilers doing vectorization and similar work. Unlike
some optimization hints, this one ends up allowing the compiler to
generate code which can fail in impressive ways if the constraint (that
the array has at least that many members) is not met at runtime. The
syntax here was invalid in C89, and only some compilers support it now.
Although it is not a storage class specifier, the inline specifier
(which can only be applied to function declarations) has similar effects.
The inline specifier encourages the compiler to replace calls to a
function with code having the same effect; as with register, inline is
a hint. It is possible to provide an inline definition of a function
which also has an external definition; it is up to the implementation to
decide which one to use. Thus, if the semantics of the two functions are
different, such that using one of them would be a bug, the program is
buggy. The inline qualifier does not guarantee that a function will
actually be converted to inline instructions; it merely suggests to the
compiler that speed is more important than space-efficiency for a given
function.
Although it is syntactically a storage-class specifier, the typedef
keyword does not specify a storage class; Part 4 of
this series explains it.
Type qualifiers
Type qualifiers do not modify the underlying storage or representation of
an object, but tell the compiler something about how that object will be
used. There are three type qualifiers: const, restrict, and
volatile.
The const qualifier is a promise that the programmer won't try to
modify something through a given value. If an object is declared as
const, the programmer is committed to never modify it; if a pointer to
const-qualified type is declared, the programmer is merely committing
not to modify it through that pointer. The const qualifier is often
misunderstood, not least because it's used differently in C++, which uses
it to declare constants.
It doesn't prohibit the value from being changed, either through some
other access to the same chunk of storage, or through the actions of the
machine. For instance, a timer register might well be declared const,
because it can't be written to, but it would still change frequently.
Similarly, a function which takes a pointer to a const-qualified value
is promising that it won't modify that value, but we cannot assume that
the value in question can't be changed by some other code.
The volatile qualifier instructs the compiler to ensure that all
accesses to the qualified object actually occur. This is primarily used
for objects which represent hardware, and essentially prohibits
optimizations.
Consider what an optimizer might otherwise do to code such as:
Listing 2. A loop that does nothing
extern int timer;
int x;
x = timer + 10;
while (timer < x)
;
|
The optimizer can immediately see that it is impossible for the condition
to ever be satisfied. However, if the memory location accessed by the
variable timer is actually a special timer register, this impression is
wrong. Therefore, if the declaration instead reads extern volatile int
timer, the optimizer would be expected to generate the code.
One common application is threaded code, where one thread can modify a
value that another thread watches. Threads are theoretically outside the scope of the formal C standard,
but many implementations provide them. The formal semantics of the
volatile qualifier end up making access
to shared variables work well enough most of the time, even without
specific intent to generate thread-safe code.
The volatile qualifier is also sometimes used for objects to be
modified by signal handlers, for the same reason.
The restrict qualifier essentially indicates that, within the scope of
the declaration it's in, no other names will point to the same object.
For instance, the declaration of memcpy() in C99 uses restrict
qualifiers on the pointer arguments; this is because it is undefined
behavior for the memory regions given to memcpy() to overlap. By
contrast, memmove(), which can copy overlapping regions, does not have
any restrict qualifiers. The formal definition is complicated, but the
intent is simple.
The restrict qualifier is used entirely to hint to compilers about
possible optimizations. It is similar in a few ways to register. If a
conforming program contains register or restrict keywords, removing
them will not change its behavior.
Incomplete types
Normally, a type indicates the amount of storage used for an object, and
the representation used to store values. However, not all declarations
are complete. For example, it is possible to declare an array of unknown
size. As long as the array is actually defined somewhere else, the
declaration which doesn't specify the size of the array can still be used
to access the array's members.
It is also possible to provide an incomplete definition of a structure or
union type, providing the tag name but not the contents. Such an
incomplete type cannot be directly used, but pointers to such a type can
be used (but not dereferenced). You can use this to provide a level of
encapsulation many people don't realize is possible in C.
Listing 3. Encapsulation in C.
/* foo.h */
struct foo_hidden;
typedef struct foo_hidden *foo;
extern foo new_foo();
extern void delete_foo(foo);
/* foo.c */
struct foo_hidden {
/* contents go here */
};
|
Notice that the header foo.h declares the struct tag foo_hidden without
describing the contents of this structure, and that it also defines a
typedef which is a pointer to instances of this "unknown" structure.
Code including the foo.h header above can manipulate objects of type
foo only through pointers and only using the provided functions. Code using this header cannot get access to the actual declaration
of the foo_hidden structure. In fact, if foo.c is compiled into a
library and distributed that way, the programmer can't even cheat and read
the code.
This offers some insurance against the crazy things people do to try to
bypass the encapsulation offered by API interfaces. The standard C FILE
type could be implemented this way; the <stdio.h> API never
manipulates a FILE object directly, only pointers to FILE objects.
In fact, such systems are entirely conforming. The humorous "Your stdio
doesn't appear very std." message that Perl's Configure script emits when
it can't figure out how to poke around inside a private structure is
simply false.
The void type is in effect always an incomplete type; you can use
pointers to void, but they cannot be dereferenced. In standard C, these
pointers also cannot be subjected to pointer arithmetic.
Tune in next week
Now that you've been introduced to some of the basics of C types, the next
article will go into more advanced topics, such as interactions
between types, derived types, and padding bits.
Resources Learn
- Hungarian horntail:
- More C:
-
This article represents everything you ever wanted to know about C
types. For a gentle introduction to C types, see Types by P.J. Plauger and
Jim Brodie.
-
A much more detailed history of C
was written by dmr, who ought to know.
-
The comp.lang.c
FAQ is full of useful information about C, including the type system,
common pitfalls, and nearly everything else. The book version, which has
additional content, is particularly valuable.
-
Andrew Koenig's paper, C Traps and
Pitfalls, was later expanded into an excellent book of the same name.
-
Henry Spencer's Ten commandments
for C programmers remain topical and relevant today.
-
Learn more of the ins and outs of C programming in the comp.lang.c FAQ
(Frequently Asked Questions) and the comp.lang.c IAC
(Infrequently Asked Questions).
- More floating point:
-
See all articles in this series.
-
Find more articles of interest in the IBM developerWorks Power
Architecture technology zone.
-
Keep abreast of all the latest Power Architecture-related news, articles,
and downloads: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
Discuss
About the author  | 
|  | Peter Seebach joined the ISO C committee as a hobby some years ago. His
favorite type is int. He has never had any of his own code fail to run
on 64-bit systems. He would love to hear about errors or omissions in
these articles; contact him at developerworks@seebs.plethora.net. |
Rate this page
|  |