Level: Intermediate Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio
05 Jan 2005 Singly linked lists are a powerful abstraction that allow programmers to represent numerous types of data; extending those lists to handle arbitrary data types can offer effective tools for processing data. In this article, we look at these processes and examine the Lisp variation Scheme, an easy-to-use list-oriented language that delivers list-manipulation capabilities without the complexities of C.
List processing is one of the computer's biggest strengths; singly linked lists provide the basis for a number of interesting algorithms and techniques that are useful in every aspect of programming. I'll show you these techniques in this article, as well as a simple language that can deliver the techniques without having to brave the complexities of C.
First, let's re-examine linked lists.
A refresher in linked lists
A singly linked list is a data structure that has an ordered sequence of nodes in which each node contains a data element and a pointer to the next node in the sequence. Conceptually, a singly linked list of integers looks like the following diagram in a computer's memory:
Figure 1. An example of a linked list of numbers
The arrows in this diagram represent pointers to the location of the next node in memory. Notice that the order of the list is determined by the arrows, not by the memory locations of the nodes.
Finally, notice that the last node in the sequence simply has a mark instead of an arrow. This indicates that there are no more nodes in the list. This marker is usually called the null value, the null list, or the null pointer.
This article will discuss only singly linked lists; the terms "linked lists" and "lists" will both be used to mean singly linked lists.
The list node structure
In C++, the Standard Template Library includes an implementation of linked lists, but for the techniques I'm going to show here, the Standard Template Library implementation is not adequate.
The task of coding linked lists is fairly simple. For example, the linked list of integers in the diagram above could be represented by the following node structure in C:
Listing 1. Basic linked list node structure
#define NULL 0
struct ll_int_node {
int data;
struct ll_int_node *next;
};
|
Here you have a struct that includes the first element as a data item and the second element as a pointer to the next node. NULL is defined simply as a null pointer (0) because the null pointer is invalid as a pointer anyway. To have a variable hold this kind of list in a program, all you need is to have a pointer that points to type struct ll_int_node. This pointer will point to the first element (known as the "head") of the list.
Reviewing common list operations
There are two basic update operations performed in standard linked list implementations: inserting and removing a node from the list.
I won't spend much time analyzing these operations, because my approach to linked lists will be from a different perspective (which I'll explain later). These methods are easily implemented by means of allocating or de-allocating node structures, then modifying the pointers so that the given node is correctly sequenced in the list.
Conceptually, it's just a matter of moving the arrows. Care must be taken, however, to ensure that the head of the list is properly updated when insertions or removals change the list head. For links to more in-depth treatments of the basics of linked list manipulation, see the Resources section at the end of this article.
Storing data: A more generic list structure
It would be quite a pain if you had to come up with a different list node structure for every type of data you wanted to store in lists, especially if you wanted to store multiple types of data in a single list. Therefore, I will show you a more generic method of handling data, a method that will allow you to store any data type in any node.
In order to store arbitrary types of data within a node, I will define a structure that will be handled directly in the case of basic C data types, but will be handled via a pointer in cases of complex types.
In order to accomplish this double mode of handling date, I will use typetags. A typetag is an integer variable in which the value of the integer specifies the type of data being handled. For basic C data types, the typetag lets you know which data type is being used. For complex types, you simply use a pointer and the typetag will tell you to which type of struct is being pointed to. Listing 2 shows the structure definition.
Listing 2. Generic data type structure
typedef struct {
int typetag;
union {
long long_member;
double double_member;
char char_member;
void *pointer_member;
};
} data;
/* Example declaration */
data my_data;
|
In order to make this generic data structure useful, you have to come up with a typetag value for each type you want, including pointer types. Listing 3 illustrates the full program that uses typetags to display different types of data based on the typetag.
Listing 3. Example program using typetags
#include "stdio.h"
/* Typetags for builtin types */
#define TYPETAG_LONG 1
#define TYPETAG_DOUBLE 2
#define TYPETAG_CHAR 3
/* The structure definition from before */
typedef struct {
int typetag;
union {
long long_member;
double double_member;
char char_member;
void *pointer_member;
};
} data;
/* Example struct */
struct person {
char name[200];
char address[200];
};
/* Typetag for a pointer to our struct */
#define TYPETAG_PERSON_PTR 4
void display(data d)
{
if(d.typetag == TYPETAG_LONG)
{
printf("%d\n", d.long_member);
}
else if(d.typetag == TYPETAG_PERSON_PTR)
{
struct person *p = (struct person *)d.pointer_member;
printf("%s\n%s\n", p->name, p->address);
}
/* I could go on for each type, but this is good enough for an example */
}
int main()
{
data my_data;
data more_data;
struct person *p;
/* Use the "long" type */
my_data.typetag = TYPETAG_LONG;
my_data.long_member = 3;
/* Use the "person" type */
p = (struct person *)malloc(sizeof(struct person));
strcpy(p->name, "Test");
strcpy(p->address, "Test's Address");
more_data.typetag = TYPETAG_PERSON_PTR;
more_data.pointer_member = p;
display(my_data);
display(more_data);
return 0;
}
|
If you wanted to be more object-oriented, you could also use the typetag to be an index to an array of vtables. Although an in-depth treatment of such a typetag is beyond the scope of this article, it does indicate the potential power of typetags. My more primitive version still provides you with the concept of a single data structure that can encompass any number of data types.
Armed with this new generic data type, you can now make a very generic list node structure, as shown in Listing 4.
Listing 4. Linked list using generic nodes
struct generic_node {
data the_data;
struct generic_node *next;
};
|
With this generic list node structure, you can now define lists and list operations that work regardless of the type of data stored in them. The typetag mechanism does incur a bit of overhead, but it is usually worth it when compared with the frustration of recoding list after list.
In C++, this overhead can be alleviated through templates. In fact, if you create a generic base class (like Java's Object class), you can actually use just the typing mechanisms of the C++ language rather than custom-implemented typetags. That method falls a little short of the typetag mechanism (which does not have pointer overhead for basic types) but still works well if you are dealing with complex types or if you do not mind wrapping basic types in classes.
Managing memory with linked lists
Although manual memory management with linked lists is possible, it is not a very friendly task. For example, when removing a node from a list, you have to remember to free the memory for the node structure. That's not too bad, but if the data portion of the node is a pointer or contains pointers, then you have a problem, especially if there is no clear, universal memory-management policy in effect for the program.
For example, if there is a policy that demands putting only copies of data structures and not the original structures themselves into lists, then you can safely free the memory; however, you will often want to increase speed and use less memory by just passing a shared pointer. In that case, memory management becomes much more difficult.
As a teaser, another issue with memory management involves the idea of partial sharing of list nodes (which will be discussed below). If you are sharing nodes of a list, then removing a list element or even removing an entire list may not require you to free the memory if it is still in use elsewhere.
Because of these memory issues, I'll assume in the remainder of the article that there is a garbage collector with the programs. Adding a garbage collector to a program is simple and it allows you to share data structures willy-nilly without having to worry about memory-management problems (and if you've never shared data structures will-nilly, you're missing out).
Sharing data structures can lead to other problems if not done carefully, because whenever a data structure is shared through pointers, an update to that structure through any pointer will impact every pointer that points to that structure.
In the article "Inside memory management" (see Resources), various memory-management techniques, including adding a garbage collector, are discussed.
Now I'll examine the properties of the links you'll be using.
Using shared sublists for advanced list programming
Now that you have a handle on linked lists, you should be aware of a few interesting properties of lists that you will exploit:
- Because node structures go in only one direction, they know nothing about the sequence of nodes that precede them.
- Because nodes know nothing about the nodes that precede them, each node is itself the start of a list.
- Because nodes know nothing about the nodes that precede them, a given list may actually be a sublist for many other lists.
The following diagram illustrates some examples to help you better understand these properties of lists.
Figure 2. Examples of list sharing
In this diagram there are three lists. The first list is the sequence (1 57 26 9) (from now on, I am going to write my lists in parentheses to make them easier to read). Now, this list contains several sublists. One sublist in particular is marked out in the diagram as (57 26 9); however, (26 9), (9), and () are all also valid sublists of the first list.
The last example, (), is used to refer to the empty, or null, list. Because the arrows for the nodes point in only one direction, no modification to the data has to be made to use a sublist.
In addition to plain sublists, the diagram also gives an example of shared sublists. The list (18 32 26 9) is actually sharing the last two nodes with the original list. This means that any changes to the original list will affect this one as well. Therefore, this list does not require additional memory for the last two nodes. In addition, this list may be a sublist for other lists and other lists may also use (26 9) as a sublist.
As I mentioned earlier, to take advantage of shared lists, you will definitely need a memory management strategy -- probably garbage collection.
Using sublists to avoid locking
One advantage of linked lists is that you can use them to avoid locking situations. For example, suppose you are programming a game and for this game, you maintain a list of every drawable object and have the drawing thread running independently of the game thread. During the game, you may need to add or remove objects from this list from time to time. While removing objects from the list requires locking the list, adding items (like fired bullets) to the list does not require locking if you add the items to the front of the list. The drawing thread can simply iterate through the list of objects without worry because you know that even if game objects are added during the drawing period, they will not affect the nodes that are currently being drawn. What you are drawing will simply be a sublist of the new objects list.
This technique is effective any time there is an unordered collection that frequently has to be both iterated through and appended to. As long as you append to the front of the list, you won't even have to touch the existing list structure, allowing other threads to iterate through it freely. Note, though, that if you added elements to the end of the list rather than to the front, you would be directly modifying the list structure itself. This could result in race conditions if one thread is adding to the end of the list while another is iterating through it.
Using shared sublists for undo operations
Shared sublists are quite useful for a number of applications, one of which is undo operations. If you structure a document as an append-only linked list, it allows you to easily revert changes back to previous states.
For example, the document would look like this: (modification3 modification2 modification1 original-document). This arrangement allows you to undo any change simply by moving the head of the list to a sublist. Hitting undo twice would cause the document to look like this: (modification1 original-document).
You can also do multiple versions of documents by creating branches that share a document from a specified point. This is the idea behind branches in CVS, for example. You might have (branch-modification2 branch-modification1 modification1 original-document) as a different set of modifications to the original document, as well as the set of modifications I showed earlier.
Using lists to represent trees
Lists can also be used to represent multi-node trees. Binary trees are represented by nodes that have a data value and a left and right pointer; however, there are many data structures that require trees with more than just a left and right branch. Lists are excellent representations of such trees.
It may appear that the list structure can handle only linearly sequenced values, but they can also handle "inverted trees" by using shared sublists -- this is not what I'm discussing here. If you have multiple lists sharing the same sublist, it looks like a tree graphically, but because the lists only go in a straight sequence, you cannot access the other branches from any other branch. Therefore, to represent true trees, you will need a different method.
I talked above about how to use tagged types to enable you to use any type of data in your node. Using that concept, you can also store pointers to list nodes in your node, allowing you to have lists as the data for a list node. Here's how it might look:
Figure 3. An example of a tree using linked lists
Each node can be either a branch or a leaf. You can represent this in parenthetic notation by simply using lists as values within your list. The previous diagram would be represented this way: (((15 42) 22 18) 32 (6 2)). An examination reveals that it is really just a list of three elements -- (element1 element2 element3) -- in which element1 is a list, element2 is the number 32, and element3 is the list (6 2). The list for element1 is itself a tree in which the first element is the list (15 42) and the remaining two elements are 22 and 18. Using lists of lists, therefore, you can represent trees that have any number of branches and leaves.
Using trees to represent syntax
A great use of trees is to represent the syntax of computer languages. A computer language can be viewed as a hierarchical tree of language constructs. At the top is the whole program. This program is composed of an arbitrary number of statements which are composed of an arbitrary number of expressions. In that way, trees (specifically multi-node ones) are ideal for representing computer languages.
As an example, let's examine the parsing of an if statement whose syntax look like the following:
Listing 5. Example syntax for an if statement
if(x > y)
{
call_some_function(a, b, c);
}
|
In this representation, the first element of each list will represent the language operator, action, or function call, and the remaining elements represent the arguments. The parse tree for this might look something like this:
Figure 4. An example of a parse tree using linked lists
You would write this out as (if (> x y) (call_some_function a b c)). Note that you always list the operators/functions first when processing the parse tree, because you need to know what the operator or function you are dealing with is before you know how to handle the arguments.
What's really interesting is that a large part of the parse trees of different languages are actually very similar. Most languages parse if statements and function calls in basically the same way. It would be really interesting though, if there were a language that operated directly on parse trees. In fact, there is one and it is covered in the next section.
A language designed for list manipulation
So far, I have discussed several interesting features of lists and list manipulation:
- Using a list node data structure for lists.
- Using typetags to allow list nodes and variables to store data of any arbitrary type.
- Sharing lists and using sublists.
- Using garbage collection to simplify memory management for lists.
- Representing lists in parenthetical notation.
While all of these items are beneficial, they are somewhat onerous to actually pull off in the C programming language. Assigning typetags, assigning data to the appropriate member of the union, and manually processing list nodes can all be tedious procedures. Using garbage collection can be easy or difficult depending on the garbage collector in use. Parenthetical notation is nice, but it can't be used in the C language.
It would be great if a programming language could take all these techniques and make them an integral part of the language.
That language exists, and it is Lisp. Lisp is a huge language that can be cumbersome to use, but there is a derivative called Scheme that is both powerful and extremely simple. The standard Scheme document, including the index, is only 50 pages long.
Introducing the Scheme syntax
Scheme is almost entirely based on lists and data comes in two varieties, atoms and lists. (There are also vectors, but I won't discuss them here.) An atom is a basic piece of data, such as a number, string, or character. Lists are linked lists of atoms, lists, or both. Even the programs written in Scheme are merely lists that happen to also be parse trees, like the ones you played with previously. In fact, the example parse tree you looked at previously is almost a valid Scheme program.
Here's a concrete, runnable Scheme program: (if (> 7 5) (display "Hello There!\n")). This program will compare 7 and 5, and because 7 is greater than 5, it will print "Hello There!" to the screen.
To run a Scheme program, you need a Scheme interpreter. If you are running Linux, you probably have the GUILE Scheme interpretter installed. To run a Scheme program using GUILE, just type in guile -s filename.scm.
Defining and using variables with Scheme
As I mentioned above, a Scheme program is just a collection of lists, and each list is a parse tree that is run through Scheme's evaluator to produce a result. The evaluator has four basic cases that it handles:
- Is the data a variable name? If so, then look up the variable and return the value. In the case of function names, this returns the function.
- Is the data a list? If not, then it is just a constant value, so return it.
- Is the first element of the list the name of a special form (an operator that isn't a function)? If so, then run the evaluator's internal procedure for that special form and return the result.
- Otherwise, run the evaluator on each element of the list. The first element must evaluate to a function and that function is called with the remaining list elements as arguments. The result of the function is returned as the result.
First, I'll look at the special form used to define variable names, appropriately called define. define statements look like this: (define variablename value). Subsequent changes to the variable are done using the set! special form. It works just like define but is used after a variable is already defined.
To try out these special forms, enter and run the following program:
Listing 6. Defining variable names in Scheme
(define x 200) ;Sets x to be 200
(define y x) ;Sets y to be x, which is evaluated to 200
(define a 50) ;Sets a to be 50
(define b a) ;Sets b to be a, which is evaluted to 50
(set! b 300) ;Sets b to be 300. Note that this does not affect the value of a
(display "x is ") ;display displays its argument, a string in this case
(display x) ;here, x is evaluated, and its result is displayed
(newline) ;newline just prints a newline
(display "y is ") ;now we are doing the same for the other variables
(display y)
(newline)
(display "a is ")
(display a)
(newline)
(display "b is ")
(display b)
(newline)
|
As is obvious from their names, display is a function that prints out its argument and newline is a function that simply prints out an end-of-line character.
You can use any expression that you want as a value. For example, you can use the addition function (called + in Scheme) to calculate values. You can even include the if special form as part of an expression.
Here is another code example using these concepts:
Listing 7. Example program using more complex Scheme expressions
(define a 20)
(define b (+ 20 a)) ;b is given the result of (+ 20 a). Remember that the function name
; is always listed first in the list
(define c (if (> b a) b a)) ;read this as "if b is greater than a then b else a"
(define d (if (eq? c b) "B and C are equal" "B and C are not equal"))
(set! a (+ a a a)) ;a is tripled, and then a is set to the resulting value
(display "a is ") ;now we're just printing out all our variables
(display a)
(newline)
(display "b is ")
(display b)
(newline)
(display "c is ")
(display c)
(newline)
(display "d is ")
(display d)
(newline)
|
Notice how you can stick any type of data into any variable. In the previous program, you used both integer and string data and never had to tell the Scheme language what type of data was being used. Internally, Scheme uses typetags to differentiate between the different types of data.
Creating lists with Scheme
Of course, the whole reason for using Scheme in the first place is to handle list manipulations. In Scheme, list nodes are called pairs. They are slightly different from the node structures I showed you in C, above, because the bottom half of a pair doesn't have to be a pointer to a list node. But for the purposes here, you will use pairs exactly as you use list nodes.
Scheme defines a list as being either the null list or a pair whose bottom half is a list. This means that the bottom half of a pair used in a list will always point to a list. You know you're at the end of a list when you hit the null list, written as ().
To create a list node in Scheme, you use the cons function. cons (which stands for construct) takes two arguments, the node data and the list that the bottom half of the pair will point to. If you are just starting a list, then you will use the null list. For example:
Listing 8. Example program using cons and the null list
(define myfirstschemelist (cons 23 '()))
|
cons is like a specialized malloc for Scheme pairs. In addition, since Scheme is garbage collected, you never have to worry about freeing the pair. Once you stop using the pair, the Scheme runtime will take care of cleaning up the pair for you.
The single quotation mark in front of the null list is used to tell Scheme that you are using () as data and not as a part of your parse tree.
Following is a program in Scheme that demonstrates how to create a list of the numbers 3, 4, and 5:
Listing 9. Example program showing list construction
(define mylist '()) ;The null list is a list
(set! mylist (cons 5 mylist)) ;This constructs a new node with 5 as the data, and set!
; makes mylist now point to the new node
(set! mylist (cons 3 (cons 4 mylist))) ;This constructs two nodes in one statement. The
; result of the second cons is used as the second
; argument of the first cons.
(display mylist) ; let's see what we've created!
(newline)
|
In the following listing, I'll demonstrate how this can be used to create shared sublists.
Listing 10. Example program demonstrating sublists
(define a '(1 2 3)) ;remember, ' means to treat the list as data, so now the variable a
; contains (1 2 3)
(define b (cons "Hello" a)) ;remember, each list element can be of any type.
(define c (cons 55 a)) ;Lists b and c are now both sharing list a as a sublist
(define d (cons 23 (cons "My name is Fred" (cons 22.5 b)))) ;This shares list b, which
; shares list a
;Let's print them out
(display "a is ")
(display a)
(newline)
(display "b is ")
(display b)
(newline)
(display "c is ")
(display c)
(newline)
(display "d is ")
(display d)
(newline)
|
To avoid confusion, it might be worthwhile to draw out each list node. Remember, cons creates list nodes.
Now, in order for lists to be useful, you have to be able to retrieve the data. The function to retrieve the data from the first element of a list is called car (a name which is a holdover from an ancient assembly language instruction). To retrieve the next node of a list, you need to use the function cdr. Notice that this retrieves the entire node and not just the value -- to get the value alone you would have to do a car on the result.
Adding the following code to the previous program will show how car and cdr can be used:
Listing 11. Example program demonstrating the use of car and cdr
(define e (car d)) ;e now has the data of the first node of d
(define f (cdr d)) ;f now has the list starting AFTER the first node of d
(define g (car (cdr d))) ;g now has the data of the second node of d
(define h (car (cdr (cdr (cdr (cdr d)))))) ;h now has hte data of the 5th node of d
(define i (cdr (cdr (cdr (cdr d))))) ;i now has the 4th sublist of d
(display "e is ")
(display e)
(newline)
(display "f is ")
(display f)
(newline)
(display "g is ")
(display g)
(newline)
(display "h is ")
(display h)
(newline)
(display "i is ")
(display i)
(newline)]
|
Iterating through lists in Scheme
In addition to the normal control structures available in other languages, Scheme has some special control structures that deal specifically with lists. The most important of these is map, which applies a function to every element of a list. It is called like this: (map function list-to-process). Therefore, in order to use map, you have to know how to define functions.
In Scheme, functions are defined by the use of the lambda special form, which creates and returns a nameless procedure. The returned procedure can be given a name by the use of define or it can simply be used in place. The first argument to lambda is the list of arguments the function can take. The remaining arguments to lambda are the expressions to be evaluated when the function is called.
Here's a short function to add the number 2 to every element of a list and return the resulting list.
Listing 12. Example program to demonstrate list iteration in Scheme
(define my-list '(1 2 3)) ;our original list
(define new-list
(map
(lambda (x) ;(x) is the argument list -- this takes one argument, called x
(+ x 2) ;add 2 to x and return the value
)
my-list ;this is the list to run it on.
)
)
(display "the original list is ")
(display my-list)
(newline)
(display "the resulting list is ")
(display new-list)
(newline)
|
As you can see, this creates a brand new list that is the result of applying the function to every element of the original list. In addition, it leaves the original list unchanged.
As mentioned, you can also name your functions so that they can be used repeatedly, as in the following:
Listing 13. Example program demonstrating creating and naming functions in Scheme
(define addtwo (lambda (x) (+ x 2)))
(display (addtwo 5))
(newline)
(display (map addtwo '(3 4 5)))
(newline)
|
In most programming languages, defining and naming functions are the same operation. Scheme splits it up into two separate procedures -- lambda, which defines the function, and define, which gives it a name.
Scheme generally provides the programmer the most basic building blocks of programs and allows programmers to combine them in unique and unusual ways. The separation of creating and naming functions, for instance, allows the Scheme programmer to create and use anonymous functions anywhere within their code while leaving normal, named functions easily coded.
End of list
Singly linked lists are a powerful abstraction that allow you to represent numerous types of data. Extending lists to handle arbitrary data types, to enable garbage collection, and to manipulate using list-oriented syntax can provide a useful paradigm for processing data.
The rich and powerful Scheme language is wonderful for list manipulation and is easy to understand and grasp. This article has been just a primer to provide a taste of how easy list manipulation is in list-oriented languages like Scheme. You can achieve all the power of a C implementation without any of the difficulties of managing C structures or C memory or of dealing directly with typetags.
Resources
About the author  | |  | Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.
|
Rate this page
|