Everything you ever wanted to know about C types, Part 1: What's in a type?

A 32-bit big-endian integer by any other name

This article, first in a four-part series, introduces the basics of the C type system, with an overview of what it means to talk about type and a discussion of the basic types (integer and floating point) in some detail.

Peter Seebach, Freelance author, Plethora.net

Peter SeebachPeter Seebach joined the ISO C committee as a hobby some years ago. His favorite type is int. He has never had any of his own code fail to run on 64-bit systems. He would love to hear about errors or omissions in these articles; contact him at developerworks@seebs.plethora.net.



03 January 2006

The C language provides a fairly elaborate system for data types.

This series provides an overview of the C type system, as well as some detailed discussions of some of the musty corners people don't normally know about. It is worth learning. As Henry Spencer wrote:

A programmer should understand the type structure of his language, lest great misfortune befall him. Contrary to the heresies espoused by some of the dwellers on the Western Shore, `int' and `long' are not the same type. The moment of their equivalence in size and representation is short, and the agony that awaits believers in their interchangeability shall last forever and ever once 64-bit machines become common.

The C type system has been adapted to a great number of architectures. As C was adapted to new systems, decisions had to be made. Should the int type be the same size on every new system, or should it be the most convenient size on every system, even if this meant it wasn't always the same size?

If you learn nothing else from these articles, learn this: On any C language implementation compliant with any C standard ever written, sizeof(char) is exactly one, whether char is eight bits, 16, 60, or 64. If you use the GNU autoconf test for sizeof(char), you might as well tattoo "I don't know what sizeof means" on your forehead.

The first article of the series introduces the type system itself, explaining the basic types and the system of type qualifiers and storage-class specifiers.

What do we mean by a type?

The type of an object in C describes the way in which a chunk of memory is associated with a value. In many cases, a type describes a computer's native ways of representing values. For instance, on a typical UNIX® system, declaring a variable as int will reserve enough storage to hold a single processor register. This variable will be manipulated by processor-native instructions, and will have the semantics native to the processor.

C is generally considered a strongly typed language (although some dispute the exact terminology). C compilers can catch many common or likely errors based on the types of variables and expressions. Types include both data representation and operations; for instance, division isn't the same on integer and floating-point values.

One of the most fundamental aspects of a type is its size. In C, size is measured in units of unsigned char, and returned as a value of type size_t, which is some unsigned integer type; the size of a type is the number of unsigned char objects it would take to hold all the bits used to store the object. The built-in sizeof operator yields this size. This is often referred to, quite correctly, as the "size in bytes," but it is important to know that, in C, a /byte/ is "an object of type unsigned char." On a system where unsigned char is larger than eight bits, sizeof(char) is still always one, and everything else is counted in terms of the actual number of bits in a char, not in terms of octets.

There is more to a type than just a number of bits. There are multiple possible data representations, although the C standard mandates a /binary/ representation for integers.

A binary representation is one in which a number consists of a series of bits, each of which is either on or off, whose values are powers of two, and are added together. For positive values, this is easy enough to figure out; there are bits representing the values one, two, four, eight... up to the largest bit that represents a positive value.

The question of how to represent negative numbers is more complicated. A conforming C implementation can represent negative numbers in three ways: two's complement, ones complement, and sign-magnitude. All three are permitted, but the vast majority of modern systems use two's complement. (The punctuation of the complement types may seem odd; the ones' complement is a complement of plural ones, while the two's complement involves a single power of two being flipped.) More arcane options, such as Gray code, are not allowed. (An implementer wishing to target such a machine would have to go to some lengths to hide the underlying bit representation.)

In sign/magnitude, the top bit represents sign, and everything else is unchanged. A bit pattern that represents four always represents a number with a magnitude of four; the sign bit determines whether it's positive or negative. In ones complement, negative numbers are handled differently: a given negative value is represented by inverting all the bits from the positive value. All ones represents negative zero.

The two's complement system is different. One way it is often explained is to say that you invert everything, then add 1. So, with an eight-bit value, if you want to write -127, you start by writing 127 (01111111), invert everything (10000000), then add 1 (10000001). Another way to think of it is to simply see the top bit as having a negative value, and all the other bits as still having the same values they always had. So, for instance, 10000001 in two's complement is -128 + 1, or -127. This is particularly intuitive if you're already used to performing bitwise operations. On a signed two's complement 32-bit number, the top bit has the value -2,147,483,648.

The main thing that affects users directly is that, on a two's complement system, the negative range is one larger than the positive range. A signed 16-bit number can represent the range from -32,768 to +32,767. On ones complement or sign-magnitude systems, the range would be +/-32,767, and there would be two representable 0 values. This is why the standard-imposed minimum ranges for signed types are symmetrical, even though the actual limits on many systems are asymmetrical.

As a side note, the asymmetry makes for a common pitfall in attempts to read numbers in. If you handle negative numbers by calculating their magnitude, then negating, your code may behave surprisingly on the largest negative number it can represent.

Complimentary signs

If you want to know how the various systems actually represent values, here's what the range of values would look like on a three-bit system. Raw Value Value Value
Bits (SM) (1C) (2C)
000 0 0 0
001 1 1 1
010 2 2 2
011 3 3 3
100 -0 -3 -4
101 -1 -2 -3
110 -2 -1 -2
111 -3 -0 -1

In addition to the question of storing values, there's also the question of other properties. For instance, the behavior on overflow (when a calculation exceeds the range of values a type can represent) can differ from one system to another. C mandates predictable behavior for overflow on unsigned types, but not on signed types.

Another avenue of potential variance is handling of division. The rules for division and remainder on positive numbers are fairly well defined. C89's requirements for division and remainder on negative numbers were simply that (x/y)*y+(x%y)==x. This leaves open an ambiguity. It's obvious that -5/5 is -1. What's not obvious is how to handle -3/5. Should it be -1, with a remainder of 2, or should it be 0, with a remainder of -3? Both results have been generated by real-world processors. In C99, the decision was made to require truncation towards 0; -3/5 is 0, with a remainder of -3. This imposes additional work on some processors. (Part 3 of this series comes back to this topic.)

In many cases, it is possible to convert a value from one type to another; this is called casting. To convert a value in C, you put a type name in parenthesis before the expression; this syntax is called a /cast/. For instance, (float) 1 converts the integer constant 1 to type float. Not all types can be converted freely, although most numeric values can be converted from one type to another.

Floating point math

C provides both integer types and floating point types. The floating point types are particularly variable in range and behavior. In general, floating point types have greater range than integers of the same size, and can represent non-integer values. Generally, this is accomplished by having a mantissa (a value) and an exponent. Floating point numbers are expressed as some integer multiplied by some power of two: a positive power of two for large numbers, and a negative power of two for small numbers.

As a result, not all numbers in the range of a floating point type can be represented exactly; for instance, 1/3 cannot be represented exactly, and precision gets worse as numbers get larger: a 32-bit system may be able to represent 8,388,607.5 but not 8,388,608.5. (Why? Well, because 8,388,608 is 2^23.) Most users are bitten at least once by the inability to represent 1/10 exactly in binary floating point, and many have run into the range of numbers (on a 32 bit system, starting around 16.7 million) where even data to the left of the decimal point can't be represented.

Floating point mathematics introduces additional considerations that don't apply to integer arithmetic. For instance, because floating point arithmetic is approximate, the distributive and associative laws don't always apply. (The same is true in some cases of pure integer math as well.) There are more subtle differences. For instance, the standard permits floating point operations to be performed with greater precision or range than the types used for the operands would allow. This can, in some cases, change the results of an operation from what they would be if it were really performed entirely in the specified type.

Floating point types can come in a variety of types, sizes, and representations. The C99 standard introduced a number of new features for floating point mathematics, including support for the IEEE 754 specification for floating point values and operations. Part 3 discusses these features in greater detail.

Minimum maximums

The C standard does not specify a number of bits for a type, but a required range. The ranges are specified in terms of what the smallest acceptable magnitude for the limits is. For instance, the int data type must be able to represent numbers as large as 32,767, and as small as -32,767. As it happens, this requires 16 bits of storage at a minimum. On an actual machine, the int data type may be larger than 16 bits, but not smaller. The unsigned char data type must be able to represent values from 0 to 255, so it must be at least eight bits. Finally, the long type handles ranges of at least +/-2,147,483,647, and the long long type handles ranges of at least +/-9,223,372,036,854,775,807.

Because C allows for machines that don't use two's complement arithmetic, the limits required for signed types are one smaller than the limit that would be native on a two's complement system.

Compatible types

Sometimes two or more seemingly distinct types will have the same underlying representation. For instance, char always has exactly the same representation and storage as one of either signed char or unsigned char -- but it is still a distinct type. Similarly, on a very large number of implementations, int has the same representation as one of either short or long -- on a large number of implementations, but not on all. The assumption that this was always the case was typically correct on 16-bit and 32-bit architectures, but would generally be a poor implementation choice on 64-bit architectures. This, in fact, is why the types are still considered distinct. A compiler which fails to warn you that you've mistakenly mixed int and long, because it knows that they happen to share a storage format on this particular platform, is not doing you any favors.

The char types are particularly strange in this respect. Some early compilers had plain char signed by default, because that's how other types worked, while others made it unsigned by default, because this was more convenient to developers in many cases. The debate over whether plain char should be signed or unsigned ended in a draw. If you want to talk about strings, use plain char. If you want to treat memory as an array of bytes, use unsigned char. The signed char type is useful only when you want to use char objects as tiny, signed integers.

Storage-class specifiers

Storage-class specifiers tell the compiler where or how a given object is stored. The storage-class specifiers are auto, extern, register, and static. Each of these tells the compiler something about how an object is stored.

The auto storage-class specifier requests storage that is local to a function's invocation (instantiated when the function is called and deleted when it returns), and duplicated (as necessary) during recursive calls. auto is the default storage class for objects declared within a function, and it cannot be used on objects declared outside of a function, so it is never actually necessary, and most programmers consider the auto keyword obsolete and never use it explicitly at all.

As the name implies, the register storage-class specifier is a suggestion to a compiler that the object in question might best be stored in a processor register. Depending on register availability, this assignment cannot be guaranteed, of course, but the compiler is supposed to take it as a very strong hint. Since the object might be stored in a register, and thus have no address, a programmer must not attempt to take the address of an object which was declared with the register storage class. Like auto, register cannot be used outside a function.

The extern storage class indicates that an object is being declared, but not defined; the object itself will come from some other module, and the extern declaration merely gives the compiler enough information to generate code which refers to that object. An object declared within a block, with the extern storage-class specifier, is a reference to a global variable.

The static storage-class specifier has a variety of meanings which may seem bewildering to the typical user. For the most part, it really does only one, compound, thing: It defines an object to have permanent storage, but not be visible outside the scope of its declaration.

For an object declared at file scope (outside of any function), the permanent storage is the typical behavior, so the only effect static has is to cause the object to be hidden outside the particular file it was declared in. For an object declared within a function, being invisible outside its scope is the typical behavior, so the only effect static has is to reserve permanent storage for the object.

However, it's not a new C standard without a new meaning for the keyword static. The new meaning for static in C99 is that an array argument to a function may be declared with a size modified by the static keyword.

This is an optimization hint: the compiler is promised that every call to the function will provide an array with at least that many elements. For instance, the following function declaration specifies a function which must be called with an array of 16 or more elements.

Listing 1. A new meaning for static
	void foo(int a[static 16]);

This is aimed at compilers doing vectorization and similar work. Unlike some optimization hints, this one ends up allowing the compiler to generate code which can fail in impressive ways if the constraint (that the array has at least that many members) is not met at runtime. The syntax here was invalid in C89, and only some compilers support it now.

Although it is not a storage class specifier, the inline specifier (which can only be applied to function declarations) has similar effects. The inline specifier encourages the compiler to replace calls to a function with code having the same effect; as with register, inline is a hint. It is possible to provide an inline definition of a function which also has an external definition; it is up to the implementation to decide which one to use. Thus, if the semantics of the two functions are different, such that using one of them would be a bug, the program is buggy. The inline qualifier does not guarantee that a function will actually be converted to inline instructions; it merely suggests to the compiler that speed is more important than space-efficiency for a given function.

Although it is syntactically a storage-class specifier, the typedef keyword does not specify a storage class; Part 4 of this series explains it.

Type qualifiers

Type qualifiers do not modify the underlying storage or representation of an object, but tell the compiler something about how that object will be used. There are three type qualifiers: const, restrict, and volatile.

The const qualifier is a promise that the programmer won't try to modify something through a given value. If an object is declared as const, the programmer is committed to never modify it; if a pointer to const-qualified type is declared, the programmer is merely committing not to modify it through that pointer. The const qualifier is often misunderstood, not least because it's used differently in C++, which uses it to declare constants.

It doesn't prohibit the value from being changed, either through some other access to the same chunk of storage, or through the actions of the machine. For instance, a timer register might well be declared const, because it can't be written to, but it would still change frequently. Similarly, a function which takes a pointer to a const-qualified value is promising that it won't modify that value, but we cannot assume that the value in question can't be changed by some other code.

The volatile qualifier instructs the compiler to ensure that all accesses to the qualified object actually occur. This is primarily used for objects which represent hardware, and essentially prohibits optimizations.

Consider what an optimizer might otherwise do to code such as:

Listing 2. A loop that does nothing
	extern int timer;
	int x;
	x = timer + 10;
	while (timer < x)
		;

The optimizer can immediately see that it is impossible for the condition to ever be satisfied. However, if the memory location accessed by the variable timer is actually a special timer register, this impression is wrong. Therefore, if the declaration instead reads extern volatile int timer, the optimizer would be expected to generate the code.

One common application is threaded code, where one thread can modify a value that another thread watches. Threads are theoretically outside the scope of the formal C standard, but many implementations provide them. The formal semantics of the volatile qualifier end up making access to shared variables work well enough most of the time, even without specific intent to generate thread-safe code.

The volatile qualifier is also sometimes used for objects to be modified by signal handlers, for the same reason.

The restrict qualifier essentially indicates that, within the scope of the declaration it's in, no other names will point to the same object. For instance, the declaration of memcpy() in C99 uses restrict qualifiers on the pointer arguments; this is because it is undefined behavior for the memory regions given to memcpy() to overlap. By contrast, memmove(), which can copy overlapping regions, does not have any restrict qualifiers. The formal definition is complicated, but the intent is simple.

The restrict qualifier is used entirely to hint to compilers about possible optimizations. It is similar in a few ways to register. If a conforming program contains register or restrict keywords, removing them will not change its behavior.

Incomplete types

Normally, a type indicates the amount of storage used for an object, and the representation used to store values. However, not all declarations are complete. For example, it is possible to declare an array of unknown size. As long as the array is actually defined somewhere else, the declaration which doesn't specify the size of the array can still be used to access the array's members.

It is also possible to provide an incomplete definition of a structure or union type, providing the tag name but not the contents. Such an incomplete type cannot be directly used, but pointers to such a type can be used (but not dereferenced). You can use this to provide a level of encapsulation many people don't realize is possible in C.

Listing 3. Encapsulation in C.
	/* foo.h */
	struct foo_hidden;
	typedef struct foo_hidden *foo;
	extern foo new_foo();
	extern void delete_foo(foo);

	/* foo.c */
	struct foo_hidden {
		/* contents go here */
	};

Notice that the header foo.h declares the struct tag foo_hidden without describing the contents of this structure, and that it also defines a typedef which is a pointer to instances of this "unknown" structure.

Code including the foo.h header above can manipulate objects of type foo only through pointers and only using the provided functions. Code using this header cannot get access to the actual declaration of the foo_hidden structure. In fact, if foo.c is compiled into a library and distributed that way, the programmer can't even cheat and read the code.

This offers some insurance against the crazy things people do to try to bypass the encapsulation offered by API interfaces. The standard C FILE type could be implemented this way; the <stdio.h> API never manipulates a FILE object directly, only pointers to FILE objects. In fact, such systems are entirely conforming. The humorous "Your stdio doesn't appear very std." message that Perl's Configure script emits when it can't figure out how to poke around inside a private structure is simply false.

The void type is in effect always an incomplete type; you can use pointers to void, but they cannot be dereferenced. In standard C, these pointers also cannot be subjected to pointer arithmetic.


Tune in next week

Now that you've been introduced to some of the basics of C types, the next article will go into more advanced topics, such as interactions between types, derived types, and padding bits.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=101309
ArticleTitle=Everything you ever wanted to know about C types, Part 1: What's in a type?
publish-date=01032006