Command-line processing and the need for gperf
Command-line processing is historically one of the most ignored areas in software
development. Just about any relatively complicated software has dozens of
available command-line options. In fact, it's not uncommon to find hundreds of
lines of if-else statements coded to process user
input, and maintenance of such legacy code becomes a time-consuming affair even
for seasoned programmers. In such circumstances, most C developers commonly go for
a rather long (and often nested) if-else statement with
ANSI C library functions such as strcmp,
strcasecmp, and strtok
thrown in for good measure, as Listing 1 shows.
Listing 1. C-style command-line processing
if (strtok(cmdstring, "+dumpdirectory"))
{
// code for printing help messages goes here
}
else if (strtok(cmdstring, "+dumpfile"))
{
// code for printing version info goes here
}
|
Instead of the ANSI C-based application program interface (API), a C++ developer
would probably use strings from the Standard Template Library (STL). Still,
there's no escaping the nested sequence of if-else
statements. Clearly, this approach isn't scalable as the number of options
increases. For a typical program invocation with N options, the code would end up
doing O(N2) comparisons. To produce code that runs faster and is easy
to maintain, it makes far more sense to use a hash table that stores the
command-line options, then uses the hash to validate the user-specified input.
This is where gperf comes in. It generates a hash table from the predetermined list of valid command-line options and a lookup function whose time complexity is O(1). Thus, for a typical program invocation with N options, the code needs only O(N) [N*O(1)] comparisons—an order of magnitude improvement over the legacy code.
Gperf reads in a set of keywords from a user-provided file (which typically has a .gperf extension, although this is not mandatory)—for example, commandoptions.gperf—and generates C/C++ sources for the hash table, hashing, and lookup methods. All code is directed at standard output, which in turn must be redirected to a file, like so:
gperf -L C++ command_line_options.gperf > perfecthash.hpp |
Note: The -L option tells gperf that the
generated code should be in C++.
Listing 2 shows the typical format of a gperf input file.
Listing 2. Input file format for gperf
%{
/* C code that goes verbatim in output */
%}
declarations
%%
keywords
%%
functions
|
The format consists of several elements: a C code inclusion, declarations, keywords, and functions.
The C code inclusion is an optional section enclosed between
%{ and %}. The C code and
comments written in this area are copied verbatim into the output file that gperf
generates. (Note the similarities to the GNU flex and bison utilities.)
The declarations section is also optional; you can omit it entirely if you don't
invoke gperf with the -t option. However, if you do
enable this option, the last component in the declaration section must be a
structure whose first field must be a char* or
const char* identifier called name.
It is possible, however, to override the name of the first field by using the
-K option in gperf. For example, if you want to name
this field command_option, make the gperf invocation:
gperf -t -K command_option |
Listing 3 shows the C code inclusion and declarations sections.
Listing 3. C code inclusion and declarations sections
%{
struct CommandOptionCode {
enum {
HELPVERBOSE = 1,
..., // more option codes here
_64BIT = 5
};
};
typedef struct CommandOptionCode CommandOptionCode;
%}
struct CommandOption
{
const char* command_option;
int OptionCode;
};
%%
|
The keywords section contains the keywords
—in this case,
predefined command-line arguments. Each line in this section that begins with the
number sign (#) in the first column is a comment. The keywords should be the first
field of each non-comment line; the string quotes that you typically associate
with a char* are optional here. In addition, fields can
follow the leading keyword, and you must separate them with commas and terminate
at the end of the line. These fields have a direct correspondence with the last
structure provided in the declarations section, as Listing 4 shows.
Listing 4. The keywords section
%%
+helpverbose, CommandOptionCode::HELPVERBOSE
+append_log, CommandOptionCode::APPEND_LOG
+compile, CommandOptionCode::COMPILE
|
The first entry refers to the
const char* command_option field of the structure
CommandOption, as shown in
Listing 3; the second entry refers to the
int OptionCode field in the same structure. So, what
exactly is happening here? This is, in fact, the gperf way of initializing the
hash table that will store the command-line options and their associated
attributes.
Functions is another optional section. All the text in this section starting with
%% and, extending to the end of the file, is copied
verbatim into the generated file. Just like the declarations section, it is the
user's responsibility to have valid C/C++ code in this section.
Gperf hashes a predefined set of keywords, then quickly performs lookups for
these same keywords. In line with this philosophy, gperf outputs two functions:
hash() and in_word_set().
The former is the hashing routine, while the latter is used for lookup. Gperf
output can be in either the C or C++ language—whichever you specify. If
you specify C for the output, two C functions with the above names are generated.
If the output language is C++, gperf generates a class named
Perfect_Hash, which contains these two methods.
Note: Use the -Z option to change the generated
class name.
The prototype for the hashing function is:
unsigned int hash (const char *str, unsigned int len); |
where str represents the command-line option, and
len represents its length. For example, if the
command-line argument were +helpverbose,
str would be +helpverbose,
and len would be 12.
Within the gperf-generated hash, in_word_set() is the
lookup function. The prototype for the routine depends on the user-specified
-t option. If you haven't specified this option, you're
simply dealing with the user-specific command strings stored as data in the
gperf-generated hash rather than the structure associated with the command string.
For example, in Listing 3, you have the structure
CommandOption associated with a user command argument,
which is what the in_word_set() routine would return.
You can change the name of this routine using the -N
option. The arguments to the routine are similar to the previously explained
hash() function:
const struct CommandOption* in_word_set (const char *str, unsigned int len); |
Gperf is a highly customizable tool that accepts several options. The gperf online manual (see the link in the Resources section below) describes all the options available in gperf, including:
-
-L language-name: Instructs gperf to generate output in the specified programming language. The following options are currently supported:-
KR-C: This old-style K&R C is supported by both old and new C compilers, but the new ANSI C-compliant compilers may generate warnings or, in certain cases, even flag errors. - C: This option generates C code but might not be compiled using some old C compilers without tweaking existing sources.
-
ANSI-C: This option generates ANSI C-compliant code, which can only be compiled by ANSI C-compliant or C++ compilers. - C++: This option generates C++ code.
-
-
-N: This option allows you to change the name of the lookup function. The default isin_word_set(). -
-H: This option allows you to change the name of the hashing routine. The default name ishash(). -
-Z: This option is useful when the-LC++ option is provided. It allows you to specify the name of the generated C++ class that houses thein_word_set()andhash()functions. The default name isPerfect_Hash. -
-G: This option generates the lookup table as a static global variable rather than hiding it by generating it inside the lookup function (the default behavior). -
-C: Gperf generates lookup tables as discussed earlier. The-Coption creates the lookup tables declared with theconstkeyword; the contents of all generated lookup tables then are constant—that is, read-only. Many compilers can generate more efficient code for this by putting the tables in read-only memory. -
-D: This option handles keywords that hash to duplicate values. -
-t: This option allows inclusion of a keyword structure. -
-K: This function allows the user to select the name of the keyword component in the keyword structure. -
-p: This option is supported for compatibility with previous releases of gperf. In earlier versions, it changes the return value of the generated functionin_word_set()from its default Boolean value (that is, 0 or 1) to thepointer to wordlist arraytype. This was most useful when the-toption, which allowed user-definedstructs, was used. In the latest releases of gperf, this option is not required and can be dropped.
Static search set is an abstract data type with the operations
initialize, insert, and
retrieve. Perfect hash functions are time- and
space-efficient implementations of static search sets. Gperf is a perfect
hash-function generator that constructs perfect hash functions from a
user-supplied list of keywords. Gperf translates n element list of
user-supplied keywords into source code containing k element lookup table
and two functions:
-
hash: This routine uniquely maps keywords to the range 0 .. k - 1, where k = n. If k = n,hash()is considered a minimal perfecthash()function. Such ahash()function has two properties:-
perfect property: Locating a table entry requires O(1) time—that is, at most, one string comparison is required to perform keyword recognition within the static search set. -
minimal property: The memory allocated to store the keywords is precisely large enough for the keyword set and no larger.
-
-
in_word_set: This routine useshash()to determine whether a particular string belongs to the user-supplied list, using one string comparison in the most common case.
Gperf's internal implementation centers on two internal data structures: the
keyword signatures list (Key_List) and the
associated values array (asso_values). Every
user-specified keyword and its attributes are read from a user-specified file and
stored as a node on a linked list, called Key_List.
Gperf considers only a subset of each keyword's characters as the key when it
searches for a perfect hash() function. The subset is
called the keyword signature, or keysig.
The associated values array is generated inside the
hash() function and is indexed using
keysig characters. Gperf repeatedly searches for an
associated values configuration that maps all n
keysigs onto non-duplicated hash values. A perfect
hash() function is produced when gperf finds a
configuration that assigns each keysig to a unique
location within the generated lookup table. The resulting perfect
hash() function returns an unsigned
int value in the range 0..(k-1), where k
is the maximum keyword hash value +1.
When k = n, a minimal perfect hash() function is
produced. A keyword's hash value is typically computed by combining the associated
values of its keysig with its length. By default, the
hash() function adds the associated value of a
keyword's first index position plus the associated value of its last index
position to its length; for example:
hash_value = length + asso_values[(unsigned char)keyword[1]]; |
Here's a small project to illustrate the concepts discussed so far. Consider the gperf file shown in Listing 5.
Listing 5. command_options.gperf
%{
#include "command_options.h"
typedef struct CommandOptionCode CommandOptionCode;
%}
struct CommandOption
{
const char *Option;
int OptionCode;
};
%%
+helpverbose, CommandOptionCode::HELPVERBOSE
+password, CommandOptionCode::PASSWORD
+nocopyright, CommandOptionCode::NOCOPYRIGHT
+nolog, CommandOptionCode::NOLOG
+_64bit, CommandOptionCode::_64BIT
|
Listing 6 shows the command_options.h header included
in the gperf file.
Listing 6. The command_options.h header
#ifndef __COMMANDOPTIONS_H
#define __COMMANDOPTIONS_H
struct CommandOptionCode
{
enum
{
HELPVERBOSE = 1,
PASSWORD = 2,
NOCOPYRIGHT = 3,
NOLOG = 4,
_64BIT = 5
};
};
#endif
|
The gperf command line looks like this:
gperf -CGD -N IsValidCommandLineOption -K Option -L C++ -t
command_line_options.gperf > perfecthash.hpp
|
Also generated as part of the perfecthash.hpp file is the hash table. Because the
-G option is specified from the command line, the hash
table is generated in global scope. Because the -C
option was used for gperf invocation, the hash table is defined with the
const attribute. Listing 7 shows the details of the
generated sources.
Listing 7. Generated perfecthash.hpp
/* C++ code produced by gperf version 3.0.3 */
/* Command-line: 'C:\\gperf\\gperf.exe' -CGD -N IsValidCommandLineOption -K Option
-L C++ -t command_line_options.gperf */
/* Computed positions: -k'2' */
#if !((' ' == 32) && ('!' == 33) && ('"' == 34) && ('#' == 35) \
&& ('%' == 37) && ('&' == 38) && ('\'' == 39) && ('(' == 40) \
&& (')' == 41) && ('*' == 42) && ('+' == 43) && (',' == 44) \
&& ('-' == 45) && ('.' == 46) && ('/' == 47) && ('0' == 48) \
&& ('1' == 49) && ('2' == 50) && ('3' == 51) && ('4' == 52) \
&& ('5' == 53) && ('6' == 54) && ('7' == 55) && ('8' == 56) \
&& ('9' == 57) && (':' == 58) && (';' == 59) && ('<' == 60) \
&& ('=' == 61) && ('>' == 62) && ('?' == 63) && ('A' == 65) \
&& ('B' == 66) && ('C' == 67) && ('D' == 68) && ('E' == 69) \
&& ('F' == 70) && ('G' == 71) && ('H' == 72) && ('I' == 73) \
&& ('J' == 74) && ('K' == 75) && ('L' == 76) && ('M' == 77) \
&& ('N' == 78) && ('O' == 79) && ('P' == 80) && ('Q' == 81) \
&& ('R' == 82) && ('S' == 83) && ('T' == 84) && ('U' == 85) \
&& ('V' == 86) && ('W' == 87) && ('X' == 88) && ('Y' == 89) \
&& ('Z' == 90) && ('[' == 91) && ('\\' == 92) && (']' == 93) \
&& ('^' == 94) && ('_' == 95) && ('a' == 97) && ('b' == 98) \
&& ('c' == 99) && ('d' == 100) && ('e' == 101) && ('f' == 102) \
&& ('g' == 103) && ('h' == 104) && ('i' == 105) && ('j' == 106) \
&& ('k' == 107) && ('l' == 108) && ('m' == 109) && ('n' == 110) \
&& ('o' == 111) && ('p' == 112) && ('q' == 113) && ('r' == 114) \
&& ('s' == 115) && ('t' == 116) && ('u' == 117) && ('v' == 118) \
&& ('w' == 119) && ('x' == 120) && ('y' == 121) && ('z' == 122) \
&& ('{' == 123) && ('|' == 124) && ('}' == 125) && ('~' == 126))
/* The character set is not based on ISO-646. */
#error "gperf generated tables don't work with this execution character set. \
Please report a bug to <bug-gnu-gperf@gnu.org>."
#endif
#line 1 "command_line_options.gperf"
#include "command_options.h"
typedef struct CommandOptionCode CommandOptionCode;
#line 6 "command_line_options.gperf"
struct CommandOption
{
const char *Option;
int OptionCode;
};
#define TOTAL_KEYWORDS 5
#define MIN_WORD_LENGTH 6
#define MAX_WORD_LENGTH 12
#define MIN_HASH_VALUE 6
#define MAX_HASH_VALUE 17
/* maximum key range = 12, duplicates = 0 */
class Perfect_Hash
{
private:
static inline unsigned int hash (const char *str, unsigned int len);
public:
static const struct CommandOption *IsValidCommandLineOption (const char *str,
unsigned int len);
};
inline unsigned int
Perfect_Hash::hash (register const char *str, register unsigned int len)
{
static const unsigned char asso_values[] =
{
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 0, 18, 18, 18, 18,
18, 18, 18, 18, 5, 18, 18, 18, 18, 18,
0, 18, 0, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18
};
return len + asso_values[(unsigned char)str[1]];
}
static const struct CommandOption wordlist[] =
{
#line 15 "command_line_options.gperf"
{"+nolog", CommandOptionCode::NOLOG},
#line 16 "command_line_options.gperf"
{"+_64bit", CommandOptionCode::_64BIT},
#line 13 "command_line_options.gperf"
{"+password", CommandOptionCode::PASSWORD},
#line 14 "command_line_options.gperf"
{"+nocopyright", CommandOptionCode::NOCOPYRIGHT},
#line 12 "command_line_options.gperf"
{"+helpverbose", CommandOptionCode::HELPVERBOSE}
};
static const signed char lookup[] =
{
-1, -1, -1, -1, -1, -1, 0, 1, -1, 2, -1, -1, 3, -1,
-1, -1, -1, 4
};
const struct CommandOption *
Perfect_Hash::IsValidCommandLineOption (register const char *str,
register unsigned int len)
{
if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
{
register int key = hash (str, len);
if (key <= MAX_HASH_VALUE && key >= 0)
{
register int index = lookup[key];
if (index >= 0)
{
register const char *s = wordlist[index].Option;
if (*str == *s && !strcmp (str + 1, s + 1))
return &wordlist[index];
}
}
}
return 0;
}
|
Finally, Listing 8 shows the main source listings.
Note: Listing 8 demonstrates that you could look up a command-line option
in the given list of allowed command-line option keywords in constant time and
subsequently take relevant measures to process that option.
IsValidCommandLineOption has a lookup time complexity
of O(1).
Listing 8. gperf.cpp, which defines the entry point for the application
#include "command_options.h"
#include "perfecthash.hpp"
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char* argv[])
{
string cmdLineOption = argv[1]; // First command line argument
const CommandOption* option =
Perfect_Hash::IsValidCommandLineOption(cmdLineOption.c_str(),
cmdLineOption.length());
switch (option->OptionCode)
{
case CommandOptionCode::HELPVERBOSE :
cout << "Application specific detailed help goes here"; break;
default: break;
}
return 0;
}
|
Note: All examples in this article have been tested using gperf version
3.0.3. If you're using an earlier version, you might need to use the
-p option as part of the command-line invocation.
The gperf utility is tuned to quickly generate a perfect hash for small to medium datasets. But gperf has other applications, as well. In fact, it's the tool of choice of maintaining perfect hashes for language keywords in GNU compilers, and recent advances allow you to work with larger datasets. So, consider making gperf part of your next development project.
Learn
- The
gperf
online manual
has more detail on how to use gperf.
- In the
developerWorks Linux zone,
find more resources for Linux developers, including
Linux tutorials,
as well as
our readers' favorite Linux articles and tutorials
over the last month.
- Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
-
Download gperf from the
GNU Web site.
-
Cygwin users can
download the gperf package from SourceForge.
-
Order the SEK for Linux,
a two-DVD set containing the latest IBM trial software for Linux from DB2®,
Lotus®, Rational®, Tivoli®, and WebSphere®.
- With
IBM trial software,
available for download directly from developerWorks, build your next development
project on Linux.
Discuss
- Get involved in the
developerWorks community
through our developer blogs, forums, podcasts, and community topics in our new
developerWorks spaces.
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX, as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems.
Rahul Kardam is a senior software developer specializing in complex C++-based electronic design automation tools such as simulators for hardware designs. He has programming experience in both Windows and UNIX platforms. Rahul enjoys tinkering with open source software, which he uses as a framework for designing robust, scalable code for the design automation tools he works on.





