awk [–F ere] [–v var=value …] [program] [var=value …] [file …]
awk [–F ere] [–f prog] [–v var=value …] [var=value …] [file …]
awk is a file-processing language that is well suited to data manipulation and retrieval of information from text files. If you are unfamiliar with the language, you might find it helpful to read the awk information in z/OS UNIX System Services User's Guide first.
pattern {action}
There
are two ways to specify the awk program:
You can specify program directly on the command line only if you do not use any –f prog arguments.
For a summary of the UNIX03 changes to this command, see Shell commands changed for UNIX03.
Files that you specify on the command line with the file argument provide the input data for awk to manipulate. If you specify no files or you specify a dash (–) as a file, awk reads data from standard input (stdin).
var=value
awk -f progfile a=1 f1 f2 a=2 f3
sets a to 1 before reading input from f1 and sets a to 2 before reading input from f3.
Variable initializations that appear before the first file on the command line are performed immediately after the BEGIN action. Initializations appearing after the last file are performed immediately before the END action. For more information about BEGIN and END, see Patterns.
awk -v v1=10 -f prog datafile
awk assigns the variable v1 its value before the BEGIN action of the program (but after default assignments made to such built-in variables as FS and OFMT; these built-in variables have special meaning to awk, as described later).
awk divides input into records. By default, newline characters separate records; however, you can specify a different record separator if you want.
One at a time, and in order, awk compares each input record with the pattern of every rule in the program. When a pattern matches, awk performs the action part of the rule on that input record. Patterns and actions often refer to separate fields within a record. By default, white space (usually blanks, newlines, or horizontal tab characters) separates fields; however, you can specify a different field separator string using the –F ere option).
You can omit the pattern or action part of an awk rule (but not both). If you omit pattern, awk performs the action on every input record (that is, every record matches). If you omit action, awk's default action is equivalent to: {print}.
# This is a comment
To continue program lines on the next line, add a backslash (\) to the end of the line. Statement lines ending with a comma (,), double or-bars (||), or double ampersands (&&) continue automatically on the next line.
There are three types of variables in awk: identifiers, fields, and array elements.
An identifier is a sequence of letters, digits, and underscores beginning with a letter or an underscore. These characters must be from the POSIX portable character set. (Data can come from other character sets.)
For a description of fields, see Input.
identifier[subscript]
where subscript has the form expr or expr,expr,…, refer to array elements. Each such expr can have any string value. For multiple expr subscripts, awk concatenates the string values of all expr arguments with a separate character SUBSEP between each. The initial value of SUBSEP is set to \042 (code page 01047 field separator).
We sometimes refer to fields and identifiers as scalar variables to distinguish them from arrays.
You do not declare awk variables, and you do not need to initialize them. The value of an uninitialized variable is the empty string in a string context and the number 0 in a numeric context.
Expressions consist of constants, variables, functions, regular expressions, and subscript-in-array conditions combined with operators. (Subscript-in-array conditions are described in Subxcript in array.) Each variable and expression has a string value and a corresponding numeric value; awk uses the value appropriate to the context.
When converting a numeric value to its corresponding string value, awk performs the equivalent of a call to the sprintf() function where the one and only expr argument is the numeric value and the fmt argument is either %d (if the numeric value is an integer) or the value of the variable CONVFMT (if the numeric value is not an integer). The default value of CONVFMT is %.6g. If you use a string in a numeric context, and awk cannot interpret the contents of the string as a number, it treats the value of the string as zero.
Numeric constants are sequences of decimal digits.
Escape Character | Sequence |
---|---|
\a | Audible bell |
\b | Backspace |
\f | Form feed |
\n | Newline |
\r | Carriage return |
\t | Horizontal tab |
\v | Vertical tab |
\ooo | Octal value ooo |
\xdd | Hexadecimal value dd |
\/ | Slash |
\" | Quote |
\c | Any other character c |
/e\.g\./
should
be written as: "e\\.g\\."
awk defines the subscript-in-array condition as:
index in array
where index looks like expr or (expr,...,expr). This condition evaluates to 1 if the string value of index is a subscript of array, and to 0 otherwise. This is a way to determine if an array element exists. When the element does not exist, the subscript-in-array condition does not create it.
You can access the symbol table through the built-in array SYMTAB. SYMTAB[expr] is equivalent to the variable named by the evaluation of expr.
For example, SYMTAB["var"] is a synonym for the variable var.
BEGIN { for (i in ENVIRON)
printf("%s=%s\n", i, ENVIRON[i])
exit
}
awk follows the usual precedence order of arithmetic operations, unless overridden with parentheses; a table giving the order of operations appears later in this topic.
The unary operators are +, -, ++, and - -, where you can use the ++ and - - operators as either postfix or prefix operators, as in C. The binary arithmetic operators are +, -, *, /, %, and ^.
expr ? expr1 : expr2
evaluates
to expr1 if the value of expr is
nonzero, and to expr2 otherwise.If two expressions are not separated by an operator, awk concatenates their string values.
$2 ~ /[0-9]/
selects any line where the second field contains at least one digit. awk interprets any string or variable on the right side of ~ or !~ as a dynamic regular expression.
The Boolean operators are || (or), && (and), and ! (not). awk uses short-circuit evaluation when evaluating expressions. With an && expression, if the first operator is false, the entire expression is false and it is not necessary to evaluate the second operator. With an || expression, a similar situation exists if the first operator is true.
var = expr
If op is
a binary arithmetic operator, var op= expr is
equivalent to var = var op expr, except
that var is evaluated only once.See Table 1 for the precedence rules of the operators.
Operators | Order of operations |
---|---|
(A) | Grouping |
$i V[a] | Field, array element |
V++ V-- ++V --V | Increment, decrement |
A^B | Exponentiation |
+A -A !A | Unary plus, unary minus, logical NOT |
A*B A/B A%B | Multiplication, division, remainder |
A+B A-B | Addition, subtraction |
A B | String concatenation |
A<B A>B A<=B A>=B A!=B A==B | Comparisons |
AB (with the ~ character above the A) A!~B ~ | Regular expression matching |
A in V | Array membership |
A && B | Logical AND |
A || B | Logical OR |
A ? B : C | Conditional expression |
V=B V+=B V-=B V*=B V/=B V%=B V^=B | Assignment |
awk sets the built-in variable ARGC to the number of command-line arguments. The built-in array ARGV has elements subscripted with digits from zero to ARGC-1, giving command-line arguments in the order they appeared on the command line.
The ARGC count and the ARGV vector do not include command-line options (beginning with -) or the program file (following –f). They do include the name of the command itself, initialization statements of the form var=value, and the names of input data files.
awk actually creates ARGC and ARGV before doing anything else. It then “walks through” ARGV, processing the arguments. If an element of ARGV is an empty string, awk skips it. If it contains an equals sign (=), awk interprets it as a variable assignment. If it is a minus sign (-), awk immediately reads input from stdin until it encounters the end of the file. Otherwise, awk treats the argument as a file name and reads input from that file until it reaches the end of the file. awk runs the program by “walking through” ARGV in this way; thus, if the program changes ARGV, awk can read different files and make different assignments.
awk divides input into records. A record separator character separates each record from the next. The value of the built-in variable RS gives the current record separator character; by default, it begins as the newline (\n). If you assign a different character to RS, awk uses that as the record separator character from that point on.
FS = "[,:$]"
says that commas, colons,
or dollar signs can separate fields. As a special case, assigning
FS a string that contains only a blank character sets the field separator
to white space. In this case, awk considers
any sequence of contiguous space or tab characters a single field
separator. This is the default for FS. However, if you assign FS a
string containing any other character, that character designates the
start of a new field. For example, if we set FS=\t (the
tab character), texta \t textb \t \t \t textc
contains
five fields, two of which contain only blanks. With the default setting,
this record only contains three fields, since awk considers
the sequence of multiple blanks and tabs a single separator.Field specifiers have the form $n, where n runs from 1 through NF. Such a field specifier refers to the nth field of the current input record. $0 (zero) refers to the entire current input record.
You can have only a limited number of files and pipes open at one time. You can close files and pipes during execution using the close(expr) function. The expr argument must be one that came before | or after < in getline, or after > or >> in print or printf.
If the function successfully closes the pipe, it returns zero. By closing files and pipes that you no longer need, you can use any number of files and pipes in the course of running an awk program.
regexp divides the record in the same way that the FS field separator string does. If you omit regexp in the call to split, it uses the current value of FS.
%[-][0][x][.y]c
In a string, the precision is the maximum number of characters to be printed from the string; in a number, the precision is the number of digits to be printed to the right of the decimal point in a floating-point value. If x or y is * (asterisk), the minimum field width or precision is the value of the next expr in the call to sprintf.
function name(parameter-list) {
statements
}
A function definition can appear in the place of a pattern {action} rule. The parameter-list argument contains any number of normal (scalar) and array variables separated by commas. When you call a function, awk passes scalar arguments by value, and array arguments by reference. The names specified in parameter-list are local to the function; all other names used in the function are global. You can define local variables by adding them to the end of the parameter list as long as no call to the function uses these extra parameters.
A function returns to its caller either when it runs the final statement in the function, or when it reaches an explicit return statement. The return value, if any, is specified in the return statement (see Actions).
A pattern is a regular expression, a special pattern, a pattern range, or any arithmetic expression.
BEGIN is a special pattern used to label actions that awk performs before reading any input records. END is a special pattern used to label actions that awk performs after reading all input records.
pattern1,pattern2
This matches all lines from one that matches pattern1 to one that matches pattern2, inclusive.
If you omit a pattern, or if the numeric value of the pattern is nonzero (true), awk runs the resulting action for the line.
# expression statement, e.g. assignment
expression
# if statement
if (condition)
statement
[else
statement]
# while loop
while (condition)
statement
# do-while loop
do
statement
while (condition)
# for loop
for (expression1; condition; expression2)
statement
The for statement
is equivalent to: expression1
while (condition) {
statement
expression2
}
for (i in array)
statement
If you specify an expr, the function returns the value of the expression as its result; otherwise, the function result is undefined. delete array[i] deletes element i from the given array. print expr, expr, … is described in Output. printf fmt, expr, expr, … is also described in Output.
The print statement prints its arguments with only simple formatting. If it has no arguments, it prints the entire current input record. awk adds the output record separator ORS to the end of the output that each print statement produces; when commas separate arguments in the print statement, the output field separator OFS separates the corresponding output values. ORS and OFS are built-in variables, whose values you can change by assigning them strings. The default output record separator is a newline, and the default output field separator is a space.
The variable OFMT gives the format of floating-point numbers output by print. By default, the value is %.6g; you can change this by assigning OFMT a different string value. OFMT applies only to floating-point numbers (ones with fractional parts).
The printf statement formats its arguments using the fmt argument. Formatting is the same as for the built-in function sprintf. Unlike print, printf does not add output separators automatically. This gives the program more precise control of the output.
The print and printf statements write to stdout. You can redirect output to a file or pipe.
If you add >expr to a print or printf statement, awk treats the string value of expr as a file name, and writes output to that file. Similarly, if you add >>expr, awk sends output to the current contents of the file. The distinction between > and >> is important only for the first print to the file expr. Subsequent outputs to an already open file append to what is there already.
print a > b c
Use parentheses to resolve the ambiguity.
If you add |expr to a print or printf statement, awk treats the string value of expr as an executable command and runs it with the output from the statement piped as input into the command.
As mentioned earlier, you can have only a limited number of files and pipes open at any time. To avoid going over the limit, use the close function to close files and pipes when you no longer need them.
print and printf are also available as functions with the same calling sequence, but no redirection.
awk '{print NR ":" $0}' input1
outputs
the contents of the file input1 with line numbers
prepended to each line.awk '{print NR SEP $0}' SEP=":" input1
awk –f addline.awk input1
which
produces the same output when the file addline.awk contains:
{print NR ":" $0}
/^January/ {print >> "jan"}
/^February|^March/ {print >> "febmar"}
{s += $NF}
END {print "sum is", s, "average is", s/NR}
{ tmp = $1
$1 = $2
$2 = tmp
print
}
{printf "%–6d: %s\n", NR, $0}
{
a[NR] = $0 # index using record number
}
END {
for (i = NR; i>0; --i)
print a[i]
}
{
++a[$1] # array indexed using the first field
}
END { # note output will be in undefined order
for (i in a)
print a[i], "lines start with", i
}
{
++a[FILENAME]
}
END {
for (file in a)
if (a[file] == 1)
print file, "has 1 line"
else
print file, "has", a[file], "lines"
}
BEGIN {NUMPROD = 5}
{
array[$1,$2] += $3
}
END {
print "\t Jan\t Feb\tMarch\tApril\t May\t" \
"June\tJuly\t Aug\tSept\t Oct\t Nov\t Dec"
for (prod = 1; prod <= NUMPROD; prod++) {
printf "%-7s", "prod#" prod
for (month = 1; month <= 12; month++){
printf "\t%5d", array[prod,month]
}
printf "\n"
}
}
function randint() {
return (int((rand()+1)*10))
}
BEGIN {
prize[randint(),randint()] = "$100";
prize[randint(),randint()] = "$10";
prize[1,1] = "the booby prize"
}
{
if (($1,$2) in prize)
printf "You have won %s!\n", prize[$1,$2]
}
$1==$NF {
for (i = NF; i > 0; --i)
printf "%s", $i (i>1 ? OFS : ORS)
}
function infiles(f,i) {
for (i in f)
delete f[i]
for (i = 1; i < ARGC; i++)
if (index(ARGV[i],"=") == 0)
f[i] = ARGV[i]
}
BEGIN {
infiles(a)
for (i in a)
print a[i]
exit
}
function fact(num) {
if (num <= 1)
return 1
else
return num * fact(num - 1)
}
{ print $0 " factorial is " fact($0) }
function words(file, string) {
string = "wc " fn
string | getline
close(string)
return ($2)
}
BEGIN {
for (i=1; i<ARGC; i++) {
fn = ARGV[i]
printf "There are %d words in %s.",
words(fn), fn
}
}
Any other environment variable can be accessed by the awk program itself.
See Localization for more information.
When an awk program ends because of a call to exit(), the exit status is the value passed to exit().
Most constructions in this implementation of awk are dynamic, limited only by memory restrictions of the system.
The maximum record size is guaranteed to be at least LINE_MAX as returned by getconf. The maximum field size is guaranteed to be LINE_MAX, also.
The parser stack depth is limited to 150 levels. Attempting to process extremely complicated programs may result in an overflow of this stack, causing an error.
Input must be text files.
POSIX.2, X/Open Portability Guide UNIX systems.
The ord function is an extension to traditional implementations of awk. The toupper and tolower functions and the ENVIRON array are in POSIX and the UNIX System V Release 4 version of awk. This version is a superset of New awk, as described in The AWK Programming Language by Aho, Weinberger, and Kernighan.
The standard command interpreter that the system function uses and that awk uses to run pipelines for getline, print, and printf is system-dependent. On z/OS UNIX, this interpreter is always /bin/sh.
ed, egrep, sed, vi
For more information about regexp, see Regular expressions (regexp).