Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Get started with GAWK: AWK language fundamentals

Begin learning AWK with the open source GAWK implementation

Michael Stutz, Author, Freelance Developer
Michael Stutz is author of The Linux Cookbook , which he also designed and typeset using only open source software. His research interests include digital publishing and the future of the book. He has used various UNIX operating systems for 20 years. You can reach him at stutz@dsl.org.

Summary:  Discover the basic concepts of the AWK text-processing and pattern-scanning language. This tutorial gets you started programming in AWK: You'll learn how AWK reads and sorts its input data, run AWK programs, manipulate data, and perform complex pattern matching. When you're finished, you'll also understand GNU AWK (GAWK).

Date:  19 Sep 2006
Level:  Intermediate PDF:  A4 and Letter (109 KB | 24 pages)Get Adobe® Reader®

Activity:  29347 views
Comments:  

Manipulating records and fields

When GAWK reads in a record, it stores all the fields of that record in variables. You can access each field by a $ followed by the field number -- so $1 references the first field, $2 references the second, and so on, all the way to the last field of the record.

Figure 4 shows the sample text delineated in the default records and fields.


Figure 4. Sample file broken down into AWK records and fields
Sample file broken down into AWK records      and fields

As described in Elements of an input file, you can reference an entire record with $0, which includes all fields and field separators. This is the default value for many commands. So for example, print, as you've done before, is the same as print $0 -- both commands print the entire current record.

Print specific fields

To output a certain field, give the name of that field as an argument to print. Try printing the first field in every record in the sample file:

$ awk ' { print $1 } ' sample
Heigh-ho!
Most
Then,
$

You can give multiple fields in a print statement, and they can be in any order:

$ awk ' { print $7, $3, $1 } ' sample
holly: heigh-ho! Heigh-ho!
mere is Most
 the Then,
$

Notice that some lines don't have a seventh field; in such cases, nothing is printed.

When separated by a comma, the fields are output with spaces between them. You can concatenate them by omitting the comma. To print the seventh and eighth fields concatenated together, use:

$ awk ' { print $7 $8 } ' sample
holly:
merefolly:
$

You can combine quoted text with fields. Try this:

$ awk ' { print "Field 2: " $2 } ' sample
Field 2: sing,
Field 2: friendship
Field 2: heigh-ho,
$

You should already be getting an idea of the power of working on fields and records -- with AWK, tabular data is easy to parse, manipulate, and reformat using just a few simple commands. You can use shell redirection to direct the reformatted output to a new file, or pass it down a pipeline.

Working as a filter, this functionality becomes useful in conjunction with other commands. For example, this command modifies the default output of a date so that it prints in a day month, year format:

$ date|awk '{print $3 " " $2 ", " $6}'
29 Nov, 2006
$


Change the field separator

The fields in the examples so far have been separated by space characters. That's the default behavior -- any number of space or tab characters -- and you can change it. The value of the field separator is contained in the FS variable. Like any variable in AWK, it can be redefined at any time in a program. To use a different field separator for the entire file, redefine it in a BEGIN statement.

Print the first field of the sample data with a field separator of an exclamation point (!):

$ awk ' BEGIN { FS = "!" } { print $1 } ' sample
Heigh-ho
Most friendship is feigning, most loving mere folly:
Then, heigh-ho, the holly
$

Note the differences when printing the second and third fields:

$ awk ' BEGIN { FS = "!" } { print $2 } ' sample
 sing, heigh-ho


$ awk ' BEGIN { FS = "!" } { print $3 } ' sample
 unto the green holly:


$

Try comparing the fields in the output you get with the fields listed in Figure 4.

But the field separator doesn't have to be a single character. Try using a phrase:

$ awk ' BEGIN { FS = "Heigh-ho" } { print $2 } ' sample
! sing, heigh-ho! unto the green holly:


$

In GAWK, the field separator can be any regular expression. To make each input character its own field, give FS a value of null.

Capitalization counts. The example above only matches one separator in the entire file, but the following example matches the same phrase regardless of case:

$ awk ' BEGIN { FS = "[Hh]eigh-ho" } { print $2 } ' sample
! sing, 

, the holly!
$

You can also change the field separator from the command line by specifying it as a quoted argument to the -F option:

$ awk -F "," ' { print $2 } ' sample
 heigh-ho! unto the green holly:
 most loving mere folly:
 heigh-ho
$

This functionality makes it easy to create one-liners that can parse files, such as /etc/passwd, where fields are delimited by a colon (:) character. You can easily pull out a list of full user names, for example:

$ awk -F ":" ' { print $5 } ' /etc/passwd
                


Change the record separator

As with the field separator, you can change the record separator from its default -- a newline character -- to anything you'd like. Its current value is kept in the RS variable.

Change the record separator to a comma, and try it on the sample file:

$ awk ' BEGIN { RS = "," } //' sample
Heigh-ho! sing
 heigh-ho! unto the green holly:
Most friendship is feigning
 most loving mere folly:
Then
 heigh-ho
 the holly!
$


Change the output

AWK output is handled just like AWK input data and it is divided into fields and records; the output stream has its own separators, which are initially set to the same defaults as the input separators -- spaces and newlines. The output field separator, used in print statements in which fields are separated by commas, is set to a single space, and you can change it by redefining the OFS variable. The output record separator is set to a newline character, and you can change it by redefining the ORS variable.

To strip all newlines from a file and place all the file's text on a single line -- which is useful for certain kinds of textual analysis and filtering -- just change the output record separator to the null character.

Try it on the sample file:

$ awk 'BEGIN {ORS=""} //' sample
Heigh-ho! sing, heigh-ho! unto the green holly:Most friendship\
is feigning, most loving mere folly:Then, heigh-ho, the holly!$

Every newline is stripped out, including the last. The returning shell prompt is on the same line as the output data. To add a final newline, put it in an END rule:

$  awk 'BEGIN {ORS=""} // { print } END {print "\n"}' sample
Heigh-ho! sing, heigh-ho! unto the green holly:Most friendship\
is feigning, most loving mere folly:Then, heigh-ho, the holly!
$


More GAWK variables

The NF variable contains the number of fields in the current record. Using NF references its numeric value, while using $NF references the contents of the actual field itself. So if a record has 100 fields, print NF outputs the integer 100, while print $100 outputs the same thing as print $NF -- the contents of the last field in the record.

The NR variable, in turn, contains the number of the current record. When the first record is being read, its value is 1; when the second record is being read, it increments to 2, and so on. Use it in an END pattern to output the number of lines in the input:

$ awk 'END { print "Input contains " NR " lines." }' sample
Input contains 3 lines.
$

Note: If the print statement above had been placed in a BEGIN pattern, the program would have reported that its input contained 0 lines, because the value for NR at the time of execution would be 0, as no records of input would have been read yet.

Use $NR to print the field relative to the current record number:

$ awk '{ print NR, $NR }' sample
1 Heigh-ho!
2 friendship
3 the
$

Take a look again at Figure 4, and see the values for Field 1 in the first record, Field 2 in the second record, and Field 3 in the last record. Compare this with your program output.

Try listing the number of fields for each record and the value of the last field:

$ awk ' { print "Record " NR " has " NF " fields and ends with " $NF}' sample
Record 1 has 7 fields and ends with holly:
Record 2 has 8 fields and ends with folly:
Record 3 has 4 fields and ends with holly!

There are a handful of special GAWK variables that are frequently used. Table 2 lists them and describes their meaning.


Table 2. Common GAWK variables
VariableDescription
NF This variable contains the number of fields per record.
NR This variable contains the number of the current record.
FS This variable is the field separator.
RS This variable is the record separator.
OFS This variable is the output field separator.
ORS This variable is the output record separator.
FILENAME This variable contains the name of input file being read.
IGNORECASE When IGNORECASE is set to a non-null value, GAWK ignores case in pattern matching.

5 of 9 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=160836
TutorialTitle=Get started with GAWK: AWK language fundamentals
publish-date=09192006
author1-email=stutz@dsl.org
author1-email-cc=