Generating a lexical analyzer with the lex command

The lex command helps write a C language program that can receive and translate character-stream input into program actions.

To use the lex command, you must supply or write a specification file that contains:

Extended regular expressions
Character patterns that the generated lexical analyzer recognizes.
Action statements
C language program fragments that define how the generated lexical analyzer reacts to extended regular expressions it recognizes.

For information about the format and logic allowed in this file, see the lex command in Commands Reference, Volume 3.

The lex command generates a C language program that can analyze an input stream using information in the specification file. The lex command then stores the output program in a lex.yy.c file. If the output program recognizes a simple, one-word input structure, you can compile the lex.yy.c output file with the following command to produce an executable lexical analyzer:
cc lex.yy.c -ll

However, if the lexical analyzer must recognize more complex syntax, you can create a parser program to use with the output file to ensure proper handling of any input.

You can move a lex.yy.c output file to another system if it has a C compiler that supports the lex library functions.

The compiled lexical analyzer performs the following functions:
  • Reads an input stream of characters.
  • Copies the input stream to an output stream.
  • Breaks the input stream into smaller strings that match the extended regular expressions in the lex specification file.
  • Executes an action for each extended regular expression that it recognizes. These actions are C language program fragments in the lex specification file. Each action fragment can call actions or subroutines outside of itself.

The lexical analyzer generated by the lex command uses an analysis method called a deterministic finite-state automaton. This method provides for a limited number of conditions in which the lexical analyzer can exist, along with the rules that determine the state of the lexical analyzer.

The automaton allows the generated lexical analyzer to look ahead more than one or two characters in an input stream. For example, suppose you define two rules in the lex specification file: one looks for the string ab and the other looks for the string abcdefg. If the lexical analyzer receives an input string of abcdefh, it reads characters to the end of the input string before determining that it does not match the string abcdefg. The lexical analyzer then returns to the rule that looks for the string ab, decides that it matches part of the input, and begins trying to find another match using the remaining input cdefh.

Compiling the lexical analyzer

To compile a lex program, do the following:
  1. Use the lex program to change the specification file into a C language program. The resulting program is in the lex.yy.c file.
  2. Use the cc command with the -ll flag to compile and link the program with a library of lex subroutines. The resulting executable program is in the a.out file.
For example, if the lex specification file is called lextest, enter the following commands:
lex lextest
cc lex.yy.c -ll