Although it is possible to write advanced sorting applications in Perl or
Awk, doing so may not always be necessary -- and is often a pain. Most
things you'll ever need are equally possible -- and lots easier -- with
the sort command, which can sort lines in more
than one file, merge files, and even check to see if sorting them is
necessary. You can specify sort keys (portions of lines used for
comparisons) or not, in which case sort just
compares whole lines.
So, if you want to sort your password file, you could just use the following. (Note that you cannot send the output straight to the input file, because it will corrupt the input file. That's why you need to send it to a temporary file and then rename that file to /etc/passwd -- as shown below.)
Listing 1. Simple sort
$ su - # sort /etc/passwd > /etc/passwd-new # mv /etc/passwd-new /etc/passwd |
Should you want to reverse the order of sorting, use the -r option. You can also suppress printing of
identical lines with the -u option.
A very practical feature of sort is its ability
to sort using field keys. A field is a string of text separated from other
fields with a certain single character. For example, the fields in /etc/passwd are separated with a colon (:). So, if you wanted, you could sort /etc/passwd by the user ID, group ID, comments field,
home catalog, or shell. To do this, use the -t
option followed by the character used as the separator, and the number of
the field that will be used as the sort key followed by the number of the
last field where the key will end; for example, sort -t : -k
5,5 /etc/passwd sorts the password file by the comment field, which
is the place where full user names like "John Smith" are stored. But
sort -t : -k 3,4 /etc/passwd sorts the same
file using both the user ID and the group ID. If you omit the second
number, sort will assume that the key starts at
the given field and continues to the end of each line. Try this yourself,
and observe the differences. (When numeric sorting looks wrong, add the
-g option).
Also, note that a whitespace transition is the default separator -- so if
fields are already separated by blank characters, you may omit the
separator and use -t alone. (Notice also that
numbering of fields starts with 1.)
For even finer control, you can use keys and offsets. Offsets are
separated from keys with a dot, as in -k
1.3,5.7, which means that the sort key should start on the third
character of the first field, and end at the seventh character of the
fifth field (offsets too are numbered from 1). When would you need this?
Well, I use it from time to time for sorting Apache logs; the key and
offset notation lets me skip the date fields.
Another option to watch out for is -b, which
tells sort to ignore blank characters (spaces,
tabs, etc.) and treat the first non-blank character on the line as the
start of the sort key. Also, if you use that option, offsets will be
counted from the first non-blank character (useful when the field
separator is not a blank character and when the fields may contain
strings starting with blank characters).
Further modifications of the sorting algorithm are possible with these
options: -d (use only letters, digits, and
blanks for sort keys), -f (turn off case
recognition and treat lowercase and uppercase characters as identical),
-i (ignores non-printing ASCII characters),
-M (sorts lines using three-letter
abbreviations of month names: JAN, FEB, MAR, ...), -n (sorts lines using only digits, -, and commas, or
other thousands separator). These options, as well as -b and -r, can be used as
part of a key number, in which case they apply to that key only and not
globally, like they do when they are used outside key definitions.
As an example of the use of a key number, consider:
sort -t: -k 4g,4 -k 3gr,3 /etc/passwd
This will sort the passwd file by group ID and within groups by userid, backwards.
But that's not all that sort is capable of. It can also resolve ties that
happen when the keys you used cannot be used to decide which line is
first. To add hints for resolving ties, add another -k option and follow it with the field and (optional)
offset, using the same notation as the one you used for defining keys; for example,
sort -k 3.4,4.5 -k 7.3,9.4 /etc/passwd sorts
lines using keys that begin at the fourth character of the third key and
end at the fifth character of the fourth key and use the third character
of the seventh field and the fourth character of the ninth field to
resolve ties.
The last group of options deals with input, output, and temporary files.
For example, the -c option, when used in sort -c < file, checks if the input file has
been sorted yet (you can use other options as well), and if it has, reports
an error. This is handy for making checks before processing large files
that may take a long time to sort. When you use the -u option together with the -c option, it will be interpreted as a request to
check that there are no two identical line in the input file.
Also important when you are processing large files is the -T option used to specify an alternative directory
for temporary files (they are removed after sort finishes work) instead of the default /tmp.
You can use sort to process more than one file
at a time, and there are basically two ways to do it: you can use cat to concatenate them first, as in:
cat file1 file2 file3 | sort > outfile
Or, you could use this command:
sort -m file1 file2 file3 > outfile
There is one condition in the second case: each input file must be
sorted before they are all sent to sort -m
together. That may look like an unnecessary burden, but in fact it
speeds up work and saves precious system resources. Oh, and don't
forget the -m option. You can use the
-u option here to suppress printing of
identical lines.
If you need a more esoteric kind of sort routine, you might want to check
out the tsort command, which performs
a topological sort on a file. The difference between a topological and
standard sort is shown in Listing 2 (you can download happybirthday.txt from Resources).
$ cat happybirthday.txt Happy Birthday to You! Happy Birthday to You! Happy Birthday Dear Tux! Happy Birthday to You! $ sort happybirthday.txt Happy Birthday Dear Tux! Happy Birthday to You! Happy Birthday to You! Happy Birthday to You! $ tsort happybirthday.txt Dear Happy to Tux! Birthday You! |
Of course, that isn't a very useful demonstration of what you'd use
tsort for -- just an illustration of how different the output of the
two commands is.
tsort is generally used for solving a logic problem in which it's
necessary to predict a total order from observed partial orders;
for example, (from the tsort info page):
tsort <<EOF
a b c
d
e f
b c d e
EOF
will produce the output
a
b
c
d
e
f
|
Questions or comments? I'd love to hear from you -- send mail to jacek@artymiak.com.
Next time, we'll delve into tr.
- Download the example file for Listing 2, happybirthday.txt.
- Find even more info on these useful tools in the GNU
text utilities manual. (An expanded view of
the same TOC lives at MIT, where you can also find this great list of even
more useful GNU tools.)
-
Windows users can find these tools in the Cygwin package.
-
Mac OS X users may want to try Fink, which installs a rich UNIX
environment under the sleek new Mac OS X.
-
Something just not working for you? Try checking the Frequently
asked questions for GNU textutils.
-
Need more introductory info before delving in to the tools we've covered
here? Try starting with UNIXhelp for
users.
-
Of course, the classic work in this field is Unix Power Tools, from
O'Reilly and Associates (Jerry Peek, Tim O'Reilly, and Mike Loukides: 1997;
ISBN 1-56592-260-3).
-
Lest we forget, the Jargon File has an
amusing entry on the topic of sorting.
- Read Jacek's other tips in this developerWorks series:
- Get to know your textutils
- Concatenating files with cat
- Reading text streams in chunks with head and tail
- Find the Linux resource you're looking for in the developerWorks Linux zone.
Jacek Artymiak works as a freelance consultant, developer, and writer. Since 1991 he's been developing software for many commercial and free variants of UNIX and BSD operating systems (AIX, HP-UX, IRIX, Solaris, Linux, FreeBSD, NetBSD, OpenBSD, and others), as well as MS-DOS, Microsoft Windows, Mac OS, and Mac OS X. Jacek specializes in business and financial application development, Web design, network security, computer graphics, animation, and multimedia. He's a prolific writer on technology subjects and the coauthor of "Install, Configure, and Customize Slackware Linux" (Prima Tech, 2000) and "StarOffice for Linux Bible" (IDG Books, 2000). Many of Jacek's software projects can be found at SourceForge. You can learn more about him at his personal Web site and contact him at jacek@artymiak.com.
Comments (Undergoing maintenance)





