 | Level: Intermediate George Lebl (jirka@5z.com), developerWorks columnist
01 Jun 2000 In his second installment on GLib, George Lebl gets into a little more detail. He gives us the rundown on hash tables, and walks us through creating a table, inserting and looking up data, and using the iterator through entries. He also provides code and setup examples of how to use tokens and the GScanner. Given the popularity of my previous GLib article, I've decided to write
a new episode in the saga of GLib. Last time we looked at some very basic GLib concepts and went through some basic containers. This time we'll look at the slightly more exotic features.
We will examine hash tables and the lexical scanner. Hash tables
For those that don't know how hash tables work, here's a quick rundown. A hash
table is for associating data with a key. You want to be able to find
the data again quickly when given the key. It's a very
time-consuming task to look through every possible option, so hash tables figure out a
number from the key called the hash or the hash number. The hash
table is just an array from 0 to the maximum hash number possible.
Each entry in the array has a list of entries which have this hash number.
So when the hash table wants to look something up, it takes the key, figures
out its hash number, and then looks for the entry in the given location in
the array. The reason this is so quick is that most of the time the array
is larger then the number of elements. And each location in the array is not likely to have more than a few entries. Thus the time
spent searching for an element is not dependent on the number of entries
you put into the hash table. When most people think of hash tables, they think of associating a piece of
data with a string key and then retrieving the data quickly using that string
key. GLib takes this concept further. There is no reason for keys to be only strings. The keys in GHashTable, the GLib type for a hash table, are
simple void pointers. The keys can be any sort of data. Each hash
table then needs to be able to give a hash number to a key. It also needs to
be able to compare two keys to see if they are equal. This means that each
hash table stores pointers for two functions: one that gives back an unsigned
integer with the hash of the key, and one that compares two keys for equality
and returns a boolean value. You do not have to provide the table size to GHashTable. It is resized to an optimum value when necessary. This means that the hash table is always
only as big as it really needs to be.
Creating and destroying
OK then, let's start with the code. First you need to create a hash table and
specify the two functions described above. GLib defines three types of
standard functions. The first type is a double set for strings. This is g_str_hash for the
hashing function and g_str_equal for comparing strings. The second type is
g_int_hash and g_int_equal for hashes of integers. In this case you will need
to use the GINT_TO_POINTER and GPOINER_TO_INT macros described in the previous
article, when inserting or looking up things in the hash table. The third
type, g_direct_hash and g_direct_equal, is for using actual pointers as keys.
In this case the hash number is just the pointer value itself, hence the name
of the functions. The function for creating a new hash table is
g_hash_table_new. This takes the two functions as its arguments and returns
a new and empty hash table. After you are done with the hash table, you call
g_hash_table_destroy. This will free and destroy just the hash table itself,
without the data you stored in it; you are responsible for that yourself.
GHashTable *table;
table = g_hash_table_new(g_str_hash, g_str_equal);
/* here we would use the hash table */
g_hash_table_destroy(table);
|
Inserting and looking up data
Now comes the somewhat tricky part. Imagine your keys have some dynamically
allocated data. Most string hashes will work this way. So not only do you
have to allocate and free the data that you store, you also have to allocate
and free the key itself. For strings this is a simple g_strdup when you
insert a piece of data, and g_free when you remove it or before you destroy
the hash table. This is one area where it's easy to go wrong with the
GHashTable. First, make sure that you don't g_free the string too early. That
is, only free the string if you are just about to remove the entry or destroy
the hash table. Second, when inserting, make sure that the entry does not
exist yet. If it does, it will be replaced and the old key and value will
be leaked. Now that we're through with the warnings, let's see the code. The
first call is a call to insert. It takes the hash table, the key, and the value
as arguments. Let's imagine we're reading from a tab-separated spreadsheet
file and we want to insert the second column as data with the first column as
key. We have created our hash table as described, and the variable's name is table.
FILE *fp;
char buf[1024];
/* just a standard way to open files */
fp = fopen("file.txt", "r");
if(!fp) exit(1); /* if file does not exist, exit */
/* read, file line by line */
while(fgets(buf, sizeof(buf), fp)) {
char *key, *value;
/* get the first and the second field */
key = strtok(buf, "\t");
if(!key) continue;
value = strtok(NULL, "\t");
if(!value) continue;
/* insert into our table */
g_hash_table_insert(table, g_strdup(key), g_strdup(value));
}
fclose(fp);
|
This code opens the file and then reads each line into a buffer. It then uses
the standard strtok function from the C library to fetch the first and second
field by separating the line with tabs. This code violates one of the principles
set above. Let's imagine the input file has two lists with exactly the same
first field. When the second line gets inserted,
the first two pointers will just get lost. There will be no way to find
these pointers again and their allocated memory will never be freed. So what
we want to do is look up the key first, and only insert the new value if we don't find any value with that key. The g_hash_table_lookup function takes the
hash table and a key as its arguments. It returns a pointer either to the value, if
it was found, or to NULL, if the key does not exist in the hash table.
char *old_value;
/* try looking up this key */
old_value = g_hash_table_lookup(table, key);
/* if we didn't find anything, insert the new key and value */
if(old_value == NULL) {
g_hash_table_insert(table, g_strdup(key), g_strdup(value));
}
|
This is nice, but what if we want the second line to override the
first line? If we do find a value by the same key as the line that we have
just read, we want to free those old values and insert the new ones. Here we
face a problem. The g_hash_table_lookup doesn't retrieve the key that was
originally used to insert the value. We need to use another function, namely
g_hash_table_lookup_extended. This function is a bit more complicated. It
still takes the hash table and the lookup key as the first two arguments. But
then it takes two more arguments that are pointers to gpointers. This is
because in fact we want to return two pointers: the first is the original key
that was used for insertion, and the second is the value. The g_hash_table_lookup_extended function also
returns a boolean. It will return TRUE if the lookup was successful or FALSE
if the lookup failed. Using this function we can just free the original key
and value and then insert the new key and new value.
char *old_key, *old_value;
/* try looking up this key */
if(g_hash_table_lookup_extended(table, key, &old_key, &old_value)) {
/* insert the new value */
g_hash_table_insert(table, g_strdup(key), g_strdup(value));
/* just free the key and value */
g_free(old_key);
g_free(old_value);
} else
/* insert the new value */
g_hash_table_insert(table, g_strdup(key), g_strdup(value));
|
Notice that we need to insert the new value before freeing the old key and
value. This is because of the first principle, which we set earlier, not to free
the key or the value too soon. The insertion will first try to look up the
entry. So if we freed the key and value before inserting, it would try to
examine a key which points to a non existent memory location. After the
insert, this entry will have been overwritten by the new one and no longer
exists, so we can free the old key and value. A similar problem arises when we want to remove a value from the hash table.
The function g_hash_table_remove takes the hash table and a key as arguments,
and will remove that entry from the hash table. However, you are still
responsible for freeing the data. So here is another use of
g_hash_table_lookup_extended.
/* try looking up this key */
if(g_hash_table_lookup_extended(table, key, &old_key, &old_value)) {
/* remove the entry in the hash table */
g_hash_table_remove(table, key);
/* just free the key and value */
g_free(old_key);
g_free(old_value);
}
|
The above issues are only a concern when using dynamically allocated data
as a key and/or value. For example, if we were using string constants, we
wouldn't ever have to worry about freeing them. Another example would be if you were
hashing by integers.
Iterating through every entry
While hash tables are primarily designed for very fast lookup of specific data
given one key, you sometimes may need to look at and do something with every entry. An example would be freeing every key and value in the
hash table before destroying the table itself. The function for this is,
similar to the case of linked lists, g_hash_table_foreach. It takes 3
arguments: the hash table, the iterating function to call on each entry, and a
user data pointer to pass to the iterating function. Unlike linked
lists, the iterating function here takes three pointers. First it takes the
key, then the value, and then the user data pointer.
/* this just frees the key and the value */
static void
free_a_hash_table_entry(gpointer key, gpointer value, gpointer user_data)
{
g_free(key);
g_free(value);
}
/* somewhere else in the code */
g_hash_table_foreach(table, free_a_hash_table_entry, NULL);
g_hash_table_destroy(table);
|
You may also want to remove specific entries. Imagine wanting to
remove every entry where the value starts with the letter 'A'. Doing it with
the above function would involve working a bit of magic, because of a
restriction on the foreach function. You shouldn't insert or remove any items
while inside the iterating function. This could in fact crash your
application or have unexpected results. But there is a similar foreach
function designed specifically for removing members. This is the
g_hash_table_foreach_remove. It behaves very much like the
g_hash_table_foreach function, except for two differences. The first difference is that the iterating
function should now return a boolean value. If it returns a TRUE, the entry
will be deleted. If it returns FALSE, the entry will be kept. The second
difference is that the function will now return the number of items deleted, in case you
want to keep track of them.
static gboolean
remove_keys_with_A(gpointer key, gpointer value, gpointer user_data)
{
char *char_value = (char *)value;
if(char_value[0] == 'A') {
g_free(key);
g_free(value);
return TRUE;
} else
return FALSE;
}
/* somewhere else in the code */
int deleted;
deleted = g_hash_table_foreach_remove(table, remove_keys_with_A, NULL);
printf("Deleted %d items!\n", deleted);
|
The rest of GHashTable
That's almost it for hash tables. There are 3 more functions GLib provides.
The first two are g_hash_table_freeze and g_hash_table_thaw. These two
functions will stop the hash table from resizing after each insertion or
removal. This can be useful if, for example, you remove a lot of entries,
add a similar number of entries, and then you don't want the hash table to
resize down and back up when it doesn't need to. However, if you do this for the initial insertion, it won't
help you and can, in fact, hurt performance. But hashes are very fast, so this is not a big
problem. The last function is g_hash_table_size, which will just return the
number of entries in the hash table.
The lexical scanner
We will now look at the lexical scanner that GLib provides. Since the scanner is
quite a large and complicated beast, we will only cover the important parts.
A lexical scanner takes textual input, scans
it for individual "tokens," and hands you only the tokens one by one. A token
in this case would be something like a number, a character, a
punctuation mark, or an identifier. The classical example of a lexical
scanner is lex. While lex is very flexible in some areas, its design makes
it difficult to use and unsuitable for many modern programs. GScanner, the
glib lexical scanner, provides an alternative. GScanner was not, however,
designed to scan absolutely anything. It has many limitations. But for
most types of input it will suffice.
Setting up GScanner
GScanner uses a large configuration structure that you fill with your desired
preferences. We will look at this structure later. For now let's just use
the default configuration of GScanner. To create a new scanner, we use
g_scanner_new and pass to it a pointer to our configuration structure. We
can also pass it NULL and it will use the default configuration. We then need to tell GScanner where it should read input from. We can
either give it a standard Unix file descriptor, or we can give it a string
to read from. This is done with the g_scanner_input_file or
g_scanner_input_text. The g_scanner_input_file function takes an extra
argument of the Unix file descriptor to read. The g_scanner_input_text
takes two extra arguments: one is the pointer to the text buffer, and
the other is the length of the text buffer. After we're done with scanning,
we need to use g_scanner_destroy to destroy the scanner and free all its
data.
GScanner *scanner;
int filefd;
filefd = open("file.txt", O_RDONLY);
if(filefd < 0) exit(1); /* we couldn't open the file */
/* create a new scanner */
scanner = g_scanner_new(NULL);
/* set up the scanner to read from the file */
g_scanner_input_file(scanner, filefd);
/* Here we would do our scanning */
...
/* destroy the scanner */
g_scanner_destroy(scanner);
|
Tokens
What GScanner will give you as tokens is actually an enumerated value of type
GTokenType, which is basically just an integer. The types of tokens that you
can get is given by the configuration of GScanner. By default, GScanner will
return all separate ASCII characters as themselves, so you can use the
returned token directly as a character if it is less than 256. There are some
other tokens, however. The obvious one is G_TOKEN_EOF. This is given when the
end of input is reached. On the default configuration you can also get:
G_TOKEN_INT, which means that an integer was found and read on the input,
G_TOKEN_FLOAT, which means a floating point value was read on the input,
G_TOKEN_STRING, which means a string constant was read on the input (a string
constant is any number of characters enclosed in quotes), G_TOKEN_SYMBOL,
which means a symbol which you defined was read on the input,
G_TOKEN_IDENTIFIER, which means that an identifier was read on the input, or, finally,
G_TOKEN_ERROR, which means that an error has occurred while
scanning the input. These last few tokens also have an associated value with them. When GScanner finds one of these values, you can
also get the GTokenValue type.
GTokenValue is a union of different types depending on the token. The name of the
element in this union is a lowercase name of the token with 'v_' prepended.
typedef union _GTokenValue GTokenValue;
union _GTokenValue
{
gpointer v_symbol;
gchar *v_identifier;
gulong v_binary;
gulong v_octal;
gulong v_int;
gdouble v_float;
gulong v_hex;
gchar *v_string;
gchar *v_comment;
guchar v_char;
guint v_error;
};
|
If you are wondering why all the integer values are unsigned long, the answer
is simple. GScanner does not look at the number sign when scanning numbers.
So if the input contains -32 you will get two tokens: one for the minus sign, and one for the integer 32.
Getting tokens and values
Now we need to figure out how to get the token from the scanner. The
basic function to use is g_scanner_get_next_token. This function returns
the next token from the input and advances the scanner. This means that
if you call the g_scanner_get_next_token in a loop, you will get all the
tokens in order one by one. Sometimes you may want to look at the next token
without advancing to it. The g_scanner_peek_next_token does just that. Depending on the context in your
parser, this
function may be useful to see what's next. Once you have your token, if you need
to get the GTokenValue, you can call g_scanner_cur_value.
/* while the next token is something else other than end of file */
while(g_scanner_peek_next_token(scanner) != G_TOKEN_EOF) {
/* get the next token */
GTokenType token = g_scanner_get_next_token(scanner);
if(token == G_TOKEN_IDENTIFIER) {
/* if we have an identifier, print it out */
GTokenValue value = g_scanner_cur_value(scanner);
printf("%s\n", value.v_identifier);
}
}
|
You can also get the current token again with g_scanner_cur_token. If you
also need the current line or the position on the line in the input, you can
use g_scanner_cur_line or g_scanner_cur_position functions. These two functions return
the desired integer. If you want to know if the scanner has already hit
the end of the line, you can check with g_scanner_eof, which returns
TRUE if that is the case. I think this is enough to get you started. One other thing I will mention is
that symbols are basically special identifiers that give you a void pointer instead of the
string. You define each of the symbols and the void
pointers that GScanner should give you. You can also have
GScanner convert this pointer to integer and return it as a token. This
way you can add your own tokens to GScanner. GScanner
also has a setup for handling parsing errors, which I won't cover. We do not have space to go into these issues here. Maybe next time!
The setup
The last thing about GScanner is its setup structure. This is quite a large
structure and I will go through it field by field.
typedef struct _GScannerConfig GScannerConfig;
struct _GScannerConfig
{
/* Character sets
*/
gchar *cset_skip_characters; /* default: " \t\n" */
gchar *cset_identifier_first;
gchar *cset_identifier_nth;
gchar *cpair_comment_single; /* default: "#\n" */
/* Should symbol lookup work case sensitive?
*/
guint case_sensitive : 1;
/* Boolean values to be adjusted "on the fly"
* to configure scanning behavior.
*/
guint skip_comment_multi : 1; /* C like comment */
guint skip_comment_single : 1; /* single line comment */
guint scan_comment_multi : 1; /* scan multi line comments? */
guint scan_identifier : 1;
guint scan_identifier_1char : 1;
guint scan_identifier_NULL : 1;
guint scan_symbols : 1;
guint scan_binary : 1;
guint scan_octal : 1;
guint scan_float : 1;
guint scan_hex : 1; /* `0x0ff0' */
guint scan_hex_dollar : 1; /* `$0ff0' */
guint scan_string_sq : 1; /* string: 'anything' */
guint scan_string_dq : 1; /* string: "\\-escapes!\n" */
guint numbers_2_int : 1; /* bin, octal, hex => int */
guint int_2_float : 1; /* int => G_TOKEN_FLOAT? */
guint identifier_2_string : 1;
guint char_2_token : 1; /* return G_TOKEN_CHAR? */
guint symbol_2_token : 1;
guint scope_0_fallback : 1; /* try scope 0 on lookups? */
};
|
A couple of the string constants have been predefined to save you some typing. First we have G_CSET_A_2_Z and
G_CSET_a_2_z, which are the standard ASCII characters, upper case and
lower case respectively. Then there are G_CSET_LATINC and G_CSET_LATINS,
which are the Latin 1 characters, upper and lower case respectively. Since
these are just character constants, you can put them one after another.
cset_skip_characters
This is a string with a set of characters which are skipped completely,
default is " \t\n".
cset_identifier_first
The first character of an identifier. The identifier must start with
one of these characters. By default this is:
G_CSET_a_2_z
"_"
G_CSET_A_2_Z
cset_identifier_nth
Any other character of the identifier. For a C identifier,
this would include the numbers as well. By default it is:
G_CSET_a_2_z
"_0123456789"
G_CSET_A_2_Z
G_CSET_LATINS
G_CSET_LATINC
cpair_comment_single
A two character string describing a comment. The comment begins with the
first character and ends with the second character. If either one is NULL, the
single style comments are ignored. By default this is the standard hash
mark until end of line comment style and thus it is "#\n".
case_sensitive
When looking up symbols, should we be case sensitive? By default this
is FALSE.
skip_comment_multi
If to skip C style comments. Unfortunately the style of the comments is
hardcoded in. The default is TRUE.
skip_comment_single
Skip the single style comments of the style set above. The default is
TRUE.
scan_comment_multi
Should we scan for C style comments? If you are not skipping C
style comments then you will get a token of G_TOKEN_COMMENT_MULTI. The
default is TRUE.
scan_identifier
Scan for identifiers. If this is FALSE, then you will get the separate
characters only. If it is true, you will get the G_TOKEN_IDENTIFIER.
The default is TRUE.
scan_identifier_1char
Should we allow single character identifiers? If this is false, you will get
the character itself. Default is FALSE.
scan_identifier_NULL
Scan for 'NULL' as an identifier. If this is TRUE, you will get
G_TOKEN_IDENTIFIER_NULL. Default is FALSE.
scan_symbols
Try to convert identifiers into one of the defined symbols. The default
is TRUE.
scan_binary
Scan for binary style integers. If it is TRUE, you will get the
G_TOKEN_BINARY token. The default is FALSE.
scan_octal
Scan for octal style integers. If it is TRUE, you will get the
G_TOKEN_OCTAL token. The default is TRUE.
scan_float
Scan for floating point numbers. If it is TRUE, you will get the
G_TOKEN_FLOAT token. The default is TRUE.
scan_hex
Scan for hex style integers which begin with '0x'. If it is TRUE,
you will get the G_TOKEN_HEX token. The default is TRUE.
scan_hex_dollar
Scan for hex style integers which begin with '$'. If it is TRUE,
you will get the G_TOKEN_HEX token. The default is FALSE.
scan_string_sq
Scan for strings in single quotes. Every character inside the quotes
will be added to the string except for another single quote. This
will get you the G_TOKEN_STRING token. Default is TRUE.
scan_string_dq
Scan for string in double quotes. This will interpret and replace
the standard C backslash requests. This will get you the G_TOKEN_STRING
token. Default is TRUE.
numbers_2_int
Instead of returning all the integer types as different tokens, always
return just G_TOKEN_INT. Default is TRUE.
int_2_float
Instead of returning G_TOKEN_INT, always convert to float and return
G_TOKEN_FLOAT. Default is FALSE.
identifier_2_string
Return all identifiers as G_TOKEN_STRING. Default is FALSE.
char_2_token
If TRUE, this will return characters in the token itself. Otherwise it will return
G_TOKEN_CHAR, and you will have to get the character from GTokenValue.
Default is TRUE.
symbol_2_token
Convert the pointer associated with a symbol to an integer and return
it as a token. Otherwise you will have to get the pointer from
GTokenValue. Default is FALSE.
scope_0_fallback
If TRUE, the scope number 0 will be tried unless a symbol can be found
in the current scope. Default is FALSE.
|
 |
Conclusion
I think this article should give you more insight into how some of the
GLib features work. For the next article I will pick a different topic,
so that people who aren't so fond of GLib will not get bored. We can return to GLib later, if people display further interest.
Resources
About the author  | 
|  |
George (Jiri or Jirka in Czech) Lebl was born in Prague, Czechoslovakia, and now lives in San Diego, California, where he is trying to finish his degree. After several years using non-UNIX operating systems, he started using UNIX and became a UNIX bigot about four years ago, and a Free Software bigot about two years ago. He joined the GNOME project in the fall of 1997, and finally became a C bigot as well. And, most importantly, he's a VI user. He can be reached at jirka@5z.com. |
Rate this page
|  |