Use gperf for efficient C/C++ command line processing

GNU perfect (gperf) hash function generator makes short work of complex input strings

The GNU tool gperf is a "perfect" hash function that, for a given set of user-provided strings, generates C/C++ code for a hash table, a hash function, and a lookup function. Learn how to use gperf for effective command-line processing in your C/C++ code.

Arpan Sen (arpansen@gmail.com), Independent author

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX, as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems.



Rahul Kumar Kardam (rahul@syncad.com), Senior Software Developer, Synapti Computer Aided Design Pvt Ltd

Rahul Kardam is a senior software developer specializing in complex C++-based electronic design automation tools such as simulators for hardware designs. He has programming experience in both Windows and UNIX platforms. Rahul enjoys tinkering with open source software, which he uses as a framework for designing robust, scalable code for the design automation tools he works on.



25 July 2007

Also available in Russian Japanese

Command-line processing and the need for gperf

Command-line processing is historically one of the most ignored areas in software development. Just about any relatively complicated software has dozens of available command-line options. In fact, it's not uncommon to find hundreds of lines of if-else statements coded to process user input, and maintenance of such legacy code becomes a time-consuming affair even for seasoned programmers. In such circumstances, most C developers commonly go for a rather long (and often nested) if-else statement with ANSI C library functions such as strcmp, strcasecmp, and strtok thrown in for good measure, as Listing 1 shows.

Listing 1. C-style command-line processing
if (strtok(cmdstring, "+dumpdirectory"))
  {
  // code for printing help messages goes here
  }
else if (strtok(cmdstring, "+dumpfile"))
  {
  // code for printing version info goes here
  }

Instead of the ANSI C-based application program interface (API), a C++ developer would probably use strings from the Standard Template Library (STL). Still, there's no escaping the nested sequence of if-else statements. Clearly, this approach isn't scalable as the number of options increases. For a typical program invocation with N options, the code would end up doing O(N2) comparisons. To produce code that runs faster and is easy to maintain, it makes far more sense to use a hash table that stores the command-line options, then uses the hash to validate the user-specified input.

This is where gperf comes in. It generates a hash table from the predetermined list of valid command-line options and a lookup function whose time complexity is O(1). Thus, for a typical program invocation with N options, the code needs only O(N) [N*O(1)] comparisons—an order of magnitude improvement over the legacy code.


Gperf usage pattern

Gperf reads in a set of keywords from a user-provided file (which typically has a .gperf extension, although this is not mandatory)—for example, commandoptions.gperf—and generates C/C++ sources for the hash table, hashing, and lookup methods. All code is directed at standard output, which in turn must be redirected to a file, like so:

gperf  -L C++ command_line_options.gperf > perfecthash.hpp

Note: The -L option tells gperf that the generated code should be in C++.


Gperf input file format

Listing 2 shows the typical format of a gperf input file.

Listing 2. Input file format for gperf
%{
/* C code that goes verbatim in output */
%}
declarations
%%
keywords
%%
functions

The format consists of several elements: a C code inclusion, declarations, keywords, and functions.

C code inclusion

The C code inclusion is an optional section enclosed between %{ and %}. The C code and comments written in this area are copied verbatim into the output file that gperf generates. (Note the similarities to the GNU flex and bison utilities.)

Declarations

The declarations section is also optional; you can omit it entirely if you don't invoke gperf with the -t option. However, if you do enable this option, the last component in the declaration section must be a structure whose first field must be a char* or const char* identifier called name.

It is possible, however, to override the name of the first field by using the -K option in gperf. For example, if you want to name this field command_option, make the gperf invocation:

gperf -t -K command_option

Listing 3 shows the C code inclusion and declarations sections.

Listing 3. C code inclusion and declarations sections
%{
struct CommandOptionCode  {
  enum { 
      HELPVERBOSE = 1,
      ..., // more option codes here
      _64BIT = 5
   };
};
typedef struct CommandOptionCode CommandOptionCode;
%}
struct CommandOption
  {
  const char* command_option;
  int OptionCode;
  };
%%

Keywords

The keywords section contains the keywords—in this case, predefined command-line arguments. Each line in this section that begins with the number sign (#) in the first column is a comment. The keywords should be the first field of each non-comment line; the string quotes that you typically associate with a char* are optional here. In addition, fields can follow the leading keyword, and you must separate them with commas and terminate at the end of the line. These fields have a direct correspondence with the last structure provided in the declarations section, as Listing 4 shows.

Listing 4. The keywords section
%%
+helpverbose, CommandOptionCode::HELPVERBOSE
+append_log, CommandOptionCode::APPEND_LOG
+compile, CommandOptionCode::COMPILE

C++/STL-style initialization

A C++/STL style of initialization would be equivalent to creating an stl::map and using the insert() method to repeatedly insert it into the map. In contrast, whoever is responsible for maintaining the code would actually have to debug the code to figure out where exactly the initialization is occurring for every command-line option, which could and typically is all over the place in poorly written code. Gperf provides a much cleaner interface from that perspective.

The first entry refers to the const char* command_option field of the structure CommandOption, as shown in Listing 3; the second entry refers to the int OptionCode field in the same structure. So, what exactly is happening here? This is, in fact, the gperf way of initializing the hash table that will store the command-line options and their associated attributes.

Functions

Functions is another optional section. All the text in this section starting with %% and, extending to the end of the file, is copied verbatim into the generated file. Just like the declarations section, it is the user's responsibility to have valid C/C++ code in this section.


Gperf output

Gperf hashes a predefined set of keywords, then quickly performs lookups for these same keywords. In line with this philosophy, gperf outputs two functions: hash() and in_word_set(). The former is the hashing routine, while the latter is used for lookup. Gperf output can be in either the C or C++ language—whichever you specify. If you specify C for the output, two C functions with the above names are generated. If the output language is C++, gperf generates a class named Perfect_Hash, which contains these two methods.

Note: Use the -Z option to change the generated class name.

The prototype for the hashing function is:

unsigned int hash (const char *str, unsigned int len);

where str represents the command-line option, and len represents its length. For example, if the command-line argument were +helpverbose, str would be +helpverbose, and len would be 12.

Within the gperf-generated hash, in_word_set() is the lookup function. The prototype for the routine depends on the user-specified -t option. If you haven't specified this option, you're simply dealing with the user-specific command strings stored as data in the gperf-generated hash rather than the structure associated with the command string.

For example, in Listing 3, you have the structure CommandOption associated with a user command argument, which is what the in_word_set() routine would return. You can change the name of this routine using the -N option. The arguments to the routine are similar to the previously explained hash() function:

const struct CommandOption* in_word_set (const char *str, unsigned int len);

Common gperf options

Gperf is a highly customizable tool that accepts several options. The gperf online manual (see the link in the Resources section below) describes all the options available in gperf, including:

  • -L language-name: Instructs gperf to generate output in the specified programming language. The following options are currently supported:
    • KR-C: This old-style K&R C is supported by both old and new C compilers, but the new ANSI C-compliant compilers may generate warnings or, in certain cases, even flag errors.
    • C: This option generates C code but might not be compiled using some old C compilers without tweaking existing sources.
    • ANSI-C: This option generates ANSI C-compliant code, which can only be compiled by ANSI C-compliant or C++ compilers.
    • C++: This option generates C++ code.
  • -N: This option allows you to change the name of the lookup function. The default is in_word_set().
  • -H: This option allows you to change the name of the hashing routine. The default name is hash().
  • -Z: This option is useful when the -L C++ option is provided. It allows you to specify the name of the generated C++ class that houses the in_word_set() and hash() functions. The default name is Perfect_Hash.
  • -G: This option generates the lookup table as a static global variable rather than hiding it by generating it inside the lookup function (the default behavior).
  • -C: Gperf generates lookup tables as discussed earlier. The -C option creates the lookup tables declared with the const keyword; the contents of all generated lookup tables then are constant—that is, read-only. Many compilers can generate more efficient code for this by putting the tables in read-only memory.
  • -D: This option handles keywords that hash to duplicate values.
  • -t: This option allows inclusion of a keyword structure.
  • -K: This function allows the user to select the name of the keyword component in the keyword structure.
  • -p: This option is supported for compatibility with previous releases of gperf. In earlier versions, it changes the return value of the generated function in_word_set() from its default Boolean value (that is, 0 or 1) to the pointer to wordlist array type. This was most useful when the -t option, which allowed user-defined structs, was used. In the latest releases of gperf, this option is not required and can be dropped.

Gperf internals overview

Static search set is an abstract data type with the operations initialize, insert, and retrieve. Perfect hash functions are time- and space-efficient implementations of static search sets. Gperf is a perfect hash-function generator that constructs perfect hash functions from a user-supplied list of keywords. Gperf translates n element list of user-supplied keywords into source code containing k element lookup table and two functions:

  • hash: This routine uniquely maps keywords to the range 0 .. k - 1, where k = n. If k = n, hash() is considered a minimal perfect hash() function. Such a hash() function has two properties:
    • perfect property: Locating a table entry requires O(1) time—that is, at most, one string comparison is required to perform keyword recognition within the static search set.
    • minimal property: The memory allocated to store the keywords is precisely large enough for the keyword set and no larger.
  • in_word_set: This routine uses hash() to determine whether a particular string belongs to the user-supplied list, using one string comparison in the most common case.

Gperf's internal implementation centers on two internal data structures: the keyword signatures list (Key_List) and the associated values array (asso_values). Every user-specified keyword and its attributes are read from a user-specified file and stored as a node on a linked list, called Key_List. Gperf considers only a subset of each keyword's characters as the key when it searches for a perfect hash() function. The subset is called the keyword signature, or keysig.

The associated values array is generated inside the hash() function and is indexed using keysig characters. Gperf repeatedly searches for an associated values configuration that maps all nkeysigs onto non-duplicated hash values. A perfect hash() function is produced when gperf finds a configuration that assigns each keysig to a unique location within the generated lookup table. The resulting perfect hash() function returns an unsigned int value in the range 0..(k-1), where k is the maximum keyword hash value +1.

When k = n, a minimal perfect hash() function is produced. A keyword's hash value is typically computed by combining the associated values of its keysig with its length. By default, the hash() function adds the associated value of a keyword's first index position plus the associated value of its last index position to its length; for example:

hash_value = length + asso_values[(unsigned char)keyword[1]];

Sample project

Here's a small project to illustrate the concepts discussed so far. Consider the gperf file shown in Listing 5.

Listing 5. command_options.gperf
%{
#include "command_options.h"
typedef struct CommandOptionCode CommandOptionCode;
%}
struct CommandOption
  {
  const char *Option;
  int OptionCode;
  };
%%
+helpverbose, CommandOptionCode::HELPVERBOSE
+password, CommandOptionCode::PASSWORD
+nocopyright, CommandOptionCode::NOCOPYRIGHT
+nolog, CommandOptionCode::NOLOG
+_64bit, CommandOptionCode::_64BIT

Listing 6 shows the command_options.h header included in the gperf file.

Listing 6. The command_options.h header
#ifndef __COMMANDOPTIONS_H
#define __COMMANDOPTIONS_H
struct CommandOptionCode 
  {
  enum 
    {
    HELPVERBOSE = 1,
    PASSWORD = 2,
    NOCOPYRIGHT = 3,
    NOLOG = 4,
    _64BIT = 5
    };
  };
#endif

The gperf command line looks like this:

gperf -CGD -N IsValidCommandLineOption -K Option -L C++ -t 
    command_line_options.gperf > perfecthash.hpp

Also generated as part of the perfecthash.hpp file is the hash table. Because the -G option is specified from the command line, the hash table is generated in global scope. Because the -C option was used for gperf invocation, the hash table is defined with the const attribute. Listing 7 shows the details of the generated sources.

Listing 7. Generated perfecthash.hpp
/* C++ code produced by gperf version 3.0.3 */
/* Command-line: 'C:\\gperf\\gperf.exe' -CGD -N IsValidCommandLineOption -K Option 
-L C++ -t command_line_options.gperf  */
/* Computed positions: -k'2' */

#if !((' ' == 32) && ('!' == 33) && ('"' == 34) && ('#' == 35) \
      && ('%' == 37) && ('&' == 38) && ('\'' == 39) && ('(' == 40) \
      && (')' == 41) && ('*' == 42) && ('+' == 43) && (',' == 44) \
      && ('-' == 45) && ('.' == 46) && ('/' == 47) && ('0' == 48) \
      && ('1' == 49) && ('2' == 50) && ('3' == 51) && ('4' == 52) \
      && ('5' == 53) && ('6' == 54) && ('7' == 55) && ('8' == 56) \
      && ('9' == 57) && (':' == 58) && (';' == 59) && ('<' == 60) \
      && ('=' == 61) && ('>' == 62) && ('?' == 63) && ('A' == 65) \
      && ('B' == 66) && ('C' == 67) && ('D' == 68) && ('E' == 69) \
      && ('F' == 70) && ('G' == 71) && ('H' == 72) && ('I' == 73) \
      && ('J' == 74) && ('K' == 75) && ('L' == 76) && ('M' == 77) \
      && ('N' == 78) && ('O' == 79) && ('P' == 80) && ('Q' == 81) \
      && ('R' == 82) && ('S' == 83) && ('T' == 84) && ('U' == 85) \
      && ('V' == 86) && ('W' == 87) && ('X' == 88) && ('Y' == 89) \
      && ('Z' == 90) && ('[' == 91) && ('\\' == 92) && (']' == 93) \
      && ('^' == 94) && ('_' == 95) && ('a' == 97) && ('b' == 98) \
      && ('c' == 99) && ('d' == 100) && ('e' == 101) && ('f' == 102) \
      && ('g' == 103) && ('h' == 104) && ('i' == 105) && ('j' == 106) \
      && ('k' == 107) && ('l' == 108) && ('m' == 109) && ('n' == 110) \
      && ('o' == 111) && ('p' == 112) && ('q' == 113) && ('r' == 114) \
      && ('s' == 115) && ('t' == 116) && ('u' == 117) && ('v' == 118) \
      && ('w' == 119) && ('x' == 120) && ('y' == 121) && ('z' == 122) \
      && ('{' == 123) && ('|' == 124) && ('}' == 125) && ('~' == 126))
/* The character set is not based on ISO-646.  */
#error "gperf generated tables don't work with this execution character set. \
Please report a bug to <bug-gnu-gperf@gnu.org>."
#endif

#line 1 "command_line_options.gperf"

#include "command_options.h"

typedef struct CommandOptionCode CommandOptionCode;
#line 6 "command_line_options.gperf"
struct CommandOption
  {
  const char *Option;
  int OptionCode;
  };

#define TOTAL_KEYWORDS 5
#define MIN_WORD_LENGTH 6
#define MAX_WORD_LENGTH 12
#define MIN_HASH_VALUE 6
#define MAX_HASH_VALUE 17
/* maximum key range = 12, duplicates = 0 */

class Perfect_Hash
{
private:
  static inline unsigned int hash (const char *str, unsigned int len);
public:
  static const struct CommandOption *IsValidCommandLineOption (const char *str, 
                                                               unsigned int len);
};

inline unsigned int
Perfect_Hash::hash (register const char *str, register unsigned int len)
{
  static const unsigned char asso_values[] =
    {
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18,  0, 18, 18, 18, 18,
      18, 18, 18, 18,  5, 18, 18, 18, 18, 18,
       0, 18,  0, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
      18, 18, 18, 18, 18, 18
    };
  return len + asso_values[(unsigned char)str[1]];
}

static const struct CommandOption wordlist[] =
  {
#line 15 "command_line_options.gperf"
    {"+nolog", CommandOptionCode::NOLOG},
#line 16 "command_line_options.gperf"
    {"+_64bit", CommandOptionCode::_64BIT},
#line 13 "command_line_options.gperf"
    {"+password", CommandOptionCode::PASSWORD},
#line 14 "command_line_options.gperf"
    {"+nocopyright", CommandOptionCode::NOCOPYRIGHT},
#line 12 "command_line_options.gperf"
    {"+helpverbose", CommandOptionCode::HELPVERBOSE}
  };

static const signed char lookup[] =
  {
    -1, -1, -1, -1, -1, -1,  0,  1, -1,  2, -1, -1,  3, -1,
    -1, -1, -1,  4
  };

const struct CommandOption *
Perfect_Hash::IsValidCommandLineOption (register const char *str, 
                                        register unsigned int len)
{
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
    {
      register int key = hash (str, len);

      if (key <= MAX_HASH_VALUE && key >= 0)
        {
          register int index = lookup[key];

          if (index >= 0)
            {
              register const char *s = wordlist[index].Option;

              if (*str == *s && !strcmp (str + 1, s + 1))
                return &wordlist[index];
            }
        }
    }
  return 0;
}

Finally, Listing 8 shows the main source listings.

Note: Listing 8 demonstrates that you could look up a command-line option in the given list of allowed command-line option keywords in constant time and subsequently take relevant measures to process that option. IsValidCommandLineOption has a lookup time complexity of O(1).

Listing 8. gperf.cpp, which defines the entry point for the application
#include "command_options.h"
#include "perfecthash.hpp"
#include <iostream>
#include <string>
using namespace std;

int main(int argc, char* argv[])
  {
  string cmdLineOption = argv[1]; // First command line argument
  const CommandOption* option = 
    Perfect_Hash::IsValidCommandLineOption(cmdLineOption.c_str(), 
       cmdLineOption.length());
  switch (option->OptionCode)
    {
    case CommandOptionCode::HELPVERBOSE :
      cout << "Application specific detailed help goes here"; break;
    default: break;
    }
  return 0;
  }

Note: All examples in this article have been tested using gperf version 3.0.3. If you're using an earlier version, you might need to use the -p option as part of the command-line invocation.


Conclusion

The gperf utility is tuned to quickly generate a perfect hash for small to medium datasets. But gperf has other applications, as well. In fact, it's the tool of choice of maintaining perfect hashes for language keywords in GNU compilers, and recent advances allow you to work with larger datasets. So, consider making gperf part of your next development project.

Resources

Learn

Get products and technologies

  • Download gperf from the GNU Web site.
  • Cygwin users can download the gperf package from SourceForge.
  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=242695
ArticleTitle=Use gperf for efficient C/C++ command line processing
publish-date=07252007