The Linux operating system is loaded with files: configuration files, text files, documentation files, log files, user files, and the list goes on and on. Quite often, those files contain information you need to access in order to find important data. Although you can easily dump the contents of most files to the screen with standard utilities such as cat, more, and others, there are utilities better suited for filtering and parsing out only those values that are relevant to you.
As you read this article, you can open your shell and try the examples of each utility.
Before you start, you should first understand what regular expressions are and how to use them.
In their simplest form, regular expressions are the search criteria used for locating text in a file. For example, to find all lines containing the word "admin", you can search for "admin". Thus, "admin" constitutes a regular expression. If you want not only to find "admin" but also to replace it with "root", you can give the appropriate commands in a utility to substitute "root" for "admin". Both thus constitute regular expressions.
These basic rules govern regular expressions:
- Any single character or series of characters can be used to match
itself or themselves, as in the "admin" example
above.
- The caret sign (
^) signifies the beginning of a line; the dollar sign ($) signifies the end. - To literally search for special characters such as the dollar sign,
precede them with a backslash (
\). For example,\$searches for$and not the end of a line. - The period (
.) represents any single character. For example,ad..nstands for five-character entries, the first two being "ad" and the last being "n". The middle two characters can be anything, but there can be only two of them. - Any time the regular expression is contained within slashes (for
example,
/re/), the search is forward through the file. When it is enclosed in question marks (for example,?re?), the search is backward through the file. - Square brackets (
[]) signify multiple values, and a minus sign (-) indicates a range of values. For example,[0-9]is the same as[0123456789], and[a-z]is the equivalent of a search for any lowercase letter. If the first character of a list is a caret, it matches any character not in the list.
Table 1 illustrates how these matches work in practice.
| Example | Description |
|---|---|
[abc] | Matches one of "a", "b", or "c" |
[a-z] | Matches any one lowercase letter from "a" to "z" |
[A-Z] | Matches any one uppercase letter from "A" to "Z" |
[0-9] | Matches any one number from 0 to 9 |
[^0-9] | Matches any character other than the numbers from 0 to 9 |
[-0-9] | Matches any number from 0 to 9, or a dash ("-") |
[0-9-] | Matches any number from 0 to 9, or a dash ("-") |
[^-0-9] | Matches any character other than the numbers from 0 to 9, or a dash ("-") |
[a-zA-Z0-9] | Matches any alphabetic or numeric character |
With this information under your belt, let's look at the utilities.
The grep utility works by searching through each
line of a file (or files) for the first occurrence of a given string. If
that string is found, the line is printed; otherwise, the line is not
printed. The following file, which I'll name "memo," illustrates grep's
usage and results.
To: All Employees
From: Human Resources
In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes.
To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Fleming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include:
Sachin Tendulkar, who joins us from XYZ Consumer Electronics as a national account manager covering traditional mass merchants.
Brian Lara, who comes to us via PQR Company and will be responsible for managing our West Coast territory.
Shane Warne, who will become an account administrator for our warehouse clubs business and joins us from DEF division.
Effectively, we have seven new faces on board:
1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNE
Please join me in welcoming each of our new team members.
As a simple example, to find the lines that have the word "welcoming", the best approach would be to use the following command line:
# grep welcoming memo Please join me in welcoming each of our new team members. |
If you look for the word "market", the results are slightly different, as shown below.
# grep market memo In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes. |
Note that two matches are found: the requested "market", and "marketing". If the words "marketable" or "marketed" had occurred in the file, the utility would have displayed the lines containing those words as well.
Wildcards and meta-characters can be used with grep, and I strongly recommend that you place them inside quotation marks so that the shell doesn't interpret them as commands.
To find all lines that contain a number, use the following:
# grep "[0-9]" memo 1. RICKY PONTING 2. GREEME SMITH 3. STEPHEN FLEMING 4. BORIS BAKER 5. SACHIN TENDULKAR 6. BRIAN LARA 7. SHANE WARNE |
To find all lines that contain "the", use this:
# grep the memo In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes. To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Flemming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include: |
As you might have noticed, the output contains the word "these", along with exact matches of the word "the".
The grep utility, like almost every other UNIX/Linux utility, is case-sensitive, which means that a completely different result comes from looking for "The" instead of "the".
# grep The memo To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Flemming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include: |
If you are seeking a particular word or phrase and don't care about the case, there are two ways to proceed. The first is to look for both "The" and "the" by using square brackets, as shown below:
# grep "[T, t]he" memo In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes. To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Flemming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include: |
The second method is to use the -i option, which
tells grep to ignore case sensitivity.
# grep -i the memo In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes. To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Flemming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include: |
In addition to -i, there are several other
command-line options to change grep's output. The most relevant are the
following:
-c-- Suppress normal output; instead, print a count of matching lines for each input file.-l-- Suppress normal output; instead, print the name of each input file from which output would have normally been printed.-n-- Prefix each line of output with the line number within its input file.-v-- Invert the sense of matching -- that is, select lines that don't match the search criteria.
fgrep searches files for a string and prints all
lines that contain that string. Unlike grep, fgrep searches for a string
instead of searching for a pattern that matches an expression. The fgrep
utility can be thought of as grep with a few enhancements:
- You can search for more than one object at a time.
- The fgrep utility is always much faster than grep.
- You can't use fgrep to search for regular expressions with patterns.
Suppose you want to pull uppercase names from your earlier memo file. In order to find "STEPHEN" and "BRIAN", you would have to issue two separate grep commands, as shown below:
# grep STEPHEN memo 3. STEPHEN FLEMING # grep BRIAN memo 6. BRIAN LARA |
You can accomplish the same task with just one fgrep command:
# fgrep "STEPHEN > BRIAN" memo 3. STEPHEN FLEMING 6. BRIAN LARA |
Note that carriage return is required between entries. Without the carriage return, the search would look for "STEPHEN BRIAN" on each line. With the return, it looks for a match to "STEPHEN" and a match to "BRIAN".
Note also that quotation marks must be used around the targeted text. This is what differentiates the text from the filename (or filenames).
Instead of specifying search items on the command line, you can place them
in a file and use the contents of that file to search other files. The
-f option allows you to specify a master file
containing search items for which you search frequently.
For example, imagine a file named "search_items" that contains two search items for which you intend to search:
# cat search_items STEPHEN BRIAN |
The following command searches for "STEPHEN" and "BRIAN" in our earlier memo file:
# fgrep -f search_items memo 3. STEPHEN FLEMING 6. BRIAN LARA |
egrep is a more powerful version of grep that
allows you to search for more than one object at a time. Objects being
searched for are separated by carriage returns (as with fgrep) or by the
pipe symbol (|).
# egrep "STEPHEN > BRIAN" memo 3. STEPHEN FLEMING 6. BRIAN LARA # egrep "STEPHEN | BRIAN" memo 3. STEPHEN FLEMING 6. BRIAN LARA |
The two commands above do the same job.
Besides the capacity to search for multiple objects, egrep offers the ability to search for repetitions and groups:
?looks for zero repetitions or one repetition of the character that precedes the question mark.+looks for one or more repetitions of the character that precedes the plus sign.( )signifies a group.
For example, imagine that you can't remember whether Brian's surname is "Lara" or "Laras".
# egrep "LARAS?" memo 6. BRIAN LARA |
This search produces matches to both "LARA" and "LARAS". The following search is a bit different:
# egrep "STEPHEN+" memo 3. STEPHEN FLEMING |
It matches "STEPHEN", STEPHENN", STEPHENNN", and so on.
If you are looking for a word plus one of its possible derivatives, include the distinguishing characters of the derivative in parentheses.
# egrep -i "electron(ic)?s" memo Sachin Tendulkar, who joins us from XYZ Consumer Electronics as a national account manager covering traditional mass merchants. |
This finds a match for both "electrons" and "electronics".
To summarize:
- A regular expression followed by
+matches one or more occurrences of the regular expression. - A regular expression followed by
?matches zero or one occurrence of the regular expression. - Regular expressions separated by
|or by a carriage return match strings that are matched by any of the expressions. - A regular expression can be enclosed in parentheses
( )for grouping. - The command-line parameters you can use include
-c,-f,-i,-l,-n, and-v.
The grep utilities: A real-world example
The grep family of utilities can be used with any system file in text format to find a match in a line. For example, to find the entries in the /etc/passwd file for a user named "root", use the following:
# grep root /etc/passwd root:x:0:0:root:/root:/bin/bash operator:x:11:0:operator:/root:/sbin/nologin |
Because it looks for a match anywhere in the file, grep finds entries for both "root" and "operator". If you want to find only the entry with the username "root", you can modify the command as follows:
# grep "^root" /etc/passwd root:x:0:0:root:/root:/bin/bash |
With the cut utility, you can separate columns
that could constitute data fields in a file. The default delimiter is the
tab, and the -f option is used to specify the
desired field.
For example, imagine a text file named "sample" with three columns that look like this:
one two three four five six seven eight nine ten eleven twelve |
Now, apply the following command:
# cut -f2 sample |
This will return:
two five eight eleven |
If you change your command like so:
# cut -f1, 3 sample |
It will return the opposite:
one three four six seven nine ten twelve |
Several command-line options are available with this command. Besides
-f, you should be familiar with these two:
-c-- Allows you to specify characters instead of fields.-d-- Allows you to specify a delimiter other than the tab.
The ls -l command shows the permissions, number
of links, owner, group, size, date, and filenames of all the files in a
directory -- all separated by white space. If you're not interested in
most of the fields and want to see only the file owner, you can use the
following command:
# ls -l | cut -d" " -f5 root 562 root root root root root root |
This command displays only the file owner (the fifth field), ignoring every other field.
If you know the exact position at which the first character of the file
owner begins, you can use -c option to display
the first character of the file owner. Assuming that it begins with the
16th character, the following command returns the 16th character, the
first letter of the owner's name.
# ls -l | cut -c16 r r r r r r r |
If you further assume that most users will use eight characters or fewer for their name, you can use the following command:
# ls -l | cut -c16-24 |
It will return those entries in the name field.
Now, assume that the name of the file begins with the 55th character, but that it is impossible to determine how many characters it takes up after that because some filenames are considerably longer than others. A solution is to begin with the 55th character and not specifying an ending character (meaning that the entire rest of the line is taken) as shown below:
# ls -l | cut -c55- a.out cscope-15.5 cscope-15.5.tar cscope.out memo search_items test.c test.s |
Now, consider another scenario. To obtain a list of all the users on the system, you can pull only the first field from the /etc/passwd file used in an earlier example:
# cut -d":" -f1 /etc/passwd root bin daemon adm lp sync shutdown halt mail news uucp operator |
To collect the usernames and their corresponding home directories, you can pull the first and sixth fields:
# cut -d":" -f1,6 /etc/passwd root:/root bin:/bin daemon:/sbin adm:/var/adm lp:/var/spool/lpd sync:/sbin shutdown:/sbin halt:/sbin mail:/var/spool/mail news:/etc/news uucp:/var/spool/uucp operator:/root |
The paste utility combines fields from files. It
takes one line from one source and combines it with another line from
another source.
For example, imagine that the content of a file named "fileone" is:
IBM Global Services |
In addition, you have "filetwo" with this content:
United States United Kingdom India |
The following command combines the contents of these files, as shown below:
# paste fileone filetwo IBM United States Global United Kingdom Services India |
If there were more lines in fileone than filetwo, then the pasting would continue, with blank entries following the tab.
The tab character is the default delimiter, but you can change it to
anything else with the -d option.
# paste -d", " fileone filetwo IBM, United States Global, United Kingdom Services, India |
You can also use the -s option to output all of
fileone on a line, followed by a carriage return and then filetwo.
# paste -s fileone filetwo IBM Global Services United States United Kingdom India |
join is a greatly enhanced version of
paste. join works
only if the files being joined share a common field.
For example, consider the two files you were using with the paste command previously. Here's what happens when you try to combine them with join:
# join fileone filetwo |
Note that there is nothing to display. The join utility must find a common field between the files in question, and by default it expects that common field to be the first.
To see how this works, try adding some new content. Assume that fileone now contains these entries:
aaaa Jurassic Park bbbb AI cccc The Ring dddd The Mummy eeee Titanic |
And filetwo now contains the following:
aaaa Neil 1111 bbbb Steven 2222 cccc Naomi 3333 dddd Brendan 4444 eeee Kate 5555 |
Now, try that command again:
# join fileone filetwo aaaa Jurassic Park Neil 1111 bbbb AI Steven 2222 cccc The Ring Naomi 3333 dddd The Mummy Brendan 4444 eeee Titanic Kate 5555 |
The commonality of the first field was identified, and the matching entries were combined. But paste blindly took from each file to create the output; join combines only lines that match, and the match must be exact. For example, imagine you added a line to filetwo:
aaaa Neil 1111 bbbb Steven 2222 ffff Elisha 6666 cccc Naomi 3333 dddd Brendan 4444 eeee Kate 5555 |
Now, your command will produce this output:
# join fileone filetwo aaaa Jurassic Park Neil 1111 bbbb AI Steven 2222 |
As soon as the files no longer match, no further operations can be carried out. Each line in the first file is matched to the same and only the same line in the second file for a match on the default field. If matches are found, they are incorporated into the output; otherwise they are not.
By default, join looks only at the first fields for matches and outputs all
columns, but you can change this behavior. The
-1 option lets you specify which field to use
as the matching field in fileone, and the -2
option lets you specify which field to use as the matching field in
filetwo.
For example, to match the second field of fileone to the third field of filetwo, use the following syntax:
# join -1 2 -2 3 fileone filetwo |
The -o option specifies output in the format
{file.field}. Thus, to print the second field
of fileone and the third field of filetwo on matching lines, the syntax
is:
# join -o 1.2 -o 2.3 fileone filetwo |
The most obvious way you could use join in the real world would be to pull the username and the corresponding home directory from the /etc/passwd file and the group name from the /etc/group file. Groups appear in the fourth field in numerical format in the /etc/passwd file. Similarly, they appear in the third field in the /etc/group file.
# join -1 4 -2 3 -o 1.1 -o 2.1 -o 1.6 -t":" /etc/passwd /etc/group root:root:/root bin:bin:/bin daemon:daemon:/sbin adm:adm:/var/adm lp:lp:/var/spool/lpd nobody:nobody:/ vcsa:vcsa:/dev rpm:rpm:/var/lib/rpm nscd:nscd:/ ident:ident:/home/ident netdump:netdump:/var/crash sshd:sshd:/var/empty/sshd rpc:rpc:/ |
awk is one of the most powerful utilities in
Linux. It is actually a programming language in and of itself and can be
used with complex logic statements, as well as to simply pull out snippets
of text. We'll skip the details, but let's quickly review the syntax and
then walk through some real-world examples.
An awk command consists of a pattern and an action composed of one or more statements, as shown in the syntax below:
awk '/pattern/ {action}' file
|
Notice that:
- awk tests every record in the specified file (or files) for a pattern match. If a match is found, the specified action is performed.
- awk can act as a filter in a pipeline or take input from the keyboard (standard input) if no file or files are specified.
One useful action is to print the data! Here is how to reference fields in a record.
$0-- The entire record$1-- The first field in the record$2-- The second field in the record
You can also pull multiple fields in a record, separating each field by a comma.
For example, to pull the sixth field from the /etc/passwd file, the command is:
# awk -F: '{print $6}' /etc/passwd
/root
/bin
/sbin
/var/adm
/var/spool/lpd
/sbin
/sbin
/sbin
/var/spool/mail
/etc/news
/var/spool/uucp
|
Note that -F is the input field separator
defined by the predefined variable FS. It is a blank space, in my
case.
To pull the first and sixth fields from the /etc/passwd file, the command is:
# awk -F: '{print $1,$6}' /etc/passwd
root /root
bin /bin
daemon /sbin
adm /var/adm
lp /var/spool/lpd
sync /sbin
shutdown /sbin
halt /sbin
mail /var/spool/mail
news /etc/news
uucp /var/spool/uucp
operator /root
|
To print the file using a dash in place of the colon delimiter between fields, the command is:
# awk -F: '{OFS="-"}{print $1,$6}' /etc/passwd
root-/root
bin-/bin
daemon-/sbin
adm-/var/adm
lp-/var/spool/lpd
sync-/sbin
shutdown-/sbin
halt-/sbin
mail-/var/spool/mail
news-/etc/news
uucp-/var/spool/uucp
operator-/root
|
To print the file using a dash between fields, and print only the first and sixth fields in reverse order, the command is:
# awk -F: '{OFS="-"}{print $6,$1}' /etc/passwd
/root-root
/bin-bin
/sbin-daemon
/var/adm-adm
/var/spool/lpd-lp
/sbin-sync
/sbin-shutdown
/sbin-halt
/var/spool/mail-mail
/etc/news-news
/var/spool/uucp-uucp
/root-operator
|
The head utility prints the first part of each
file (10 lines by default). It reads from standard input if no files are
given, or if given a filename of -.
For example, if you want to extract the first two lines from your memo file, the command is:
# head -2 memo In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes. |
You can specify the number of bytes to display using the
-c option. For example, if you want to read the
first two bytes from the memo file, the command is:
# head -c 2 memo In |
The tail utility prints the last part of each
file (10 lines by default). It reads from standard input if no files are
given, or if given a filename of -.
For example, if you want to extract the last two lines from your earlier memo, the command is:
# tail -2 memo Please join me in welcoming each of our new team members. |
You can specify the number of bytes to display using the
-c option. For example, if you want to read the
last five bytes from the memo file, the command is:
# tail -c 5 memo ers. |
Now you know how to use various utilities to extract data from standard Linux files. Once extracted, that data can be manipulated for viewing and printing or directed into other files or databases. Knowing how to use just this handful of tools can help you spend less time on mundane tasks and become a more efficient administrator.
Learn
- Try checking the
GNU Core Utilities Frequently Asked Questions
page if something is just not working for you.
- The classic work in this field is
Unix Power Tools,
by Shelley Powers, Jerry Peek, Tim O'Reilly, and Mike Loukides (O'Reilly
and Associates, October 2003).
- The UNIX Programming Environment,
by Brian W. Kernighan and Rob Pike (Prentice Hall, Inc., 1984) is an
essential part of any programmer's bookshelf.
- Linux Bible, 2005 Edition,
by Christopher Negus (John Wiley, 2005) can help you learn more about
Linux and its utilities.
- Linux: The Complete Reference, Fifth Edition,
by Richard Petersen (Osborne/McGraw-Hill, 2002) is the ultimate in-depth
Linux resource.
-
"Developing a Linux command-line utility"
(developerWorks, June 2002) gives best practices and hints on getting
started coding.
- The
Linux Professional Institute (LPI) exam prep series
teaches the basics of systems administration.
- In the
developerWorks Linux zone,
find more resources for Linux developers.
- Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
- With
IBM trial software,
available for download directly from developerWorks, build your next
development project on Linux.
Discuss
- Check out
developerWorks
blogs
and get involved in the
developerWorks community.
Harsha Adiga works in the IBM Software Group in Bangalore, India, and is heavily involved in various Linux and open source communities and working groups. His primary focus areas include Linux and UNIX internals, porting, compilers, and code optimization. He has been involved in software development and testing on Linux and UNIX platforms for more than six years.