Skip to main content

Tip: Sorting files with sort and tsort

Get to know your textutils

Jacek Artymiak (jacek@artymiak.com), Freelance author and consultant
Jacek Artymiak works as a freelance consultant, developer, and writer. Since 1991 he's been developing software for many commercial and free variants of UNIX and BSD operating systems (AIX, HP-UX, IRIX, Solaris, Linux, FreeBSD, NetBSD, OpenBSD, and others), as well as MS-DOS, Microsoft Windows, Mac OS, and Mac OS X. Jacek specializes in business and financial application development, Web design, network security, computer graphics, animation, and multimedia. He's a prolific writer on technology subjects and the coauthor of "Install, Configure, and Customize Slackware Linux" (Prima Tech, 2000) and "StarOffice for Linux Bible" (IDG Books, 2000). Many of Jacek's software projects can be found at SourceForge. You can learn more about him at his personal Web site and contact him at jacek@artymiak.com.

Summary:  Save time and headaches by using sort and tsort -- instead of resorting to more complex solutions utilizing Perl or Awk. Jacek Artymiak explains how.

Date:  06 Mar 2003
Level:  Introductory
Activity:  2808 views

Although it is possible to write advanced sorting applications in Perl or Awk, doing so may not always be necessary -- and is often a pain. Most things you'll ever need are equally possible -- and lots easier -- with the sort command, which can sort lines in more than one file, merge files, and even check to see if sorting them is necessary. You can specify sort keys (portions of lines used for comparisons) or not, in which case sort just compares whole lines.

So, if you want to sort your password file, you could just use the following. (Note that you cannot send the output straight to the input file, because it will corrupt the input file. That's why you need to send it to a temporary file and then rename that file to /etc/passwd -- as shown below.)


Listing 1. Simple sort
				
$ su - 
# sort /etc/passwd > /etc/passwd-new
# mv /etc/passwd-new /etc/passwd
			

More on sort and tsort

Follow along in the man page by opening the GNU manual's pages on sort operations, or view these options in your man or info pages in a new terminal window by typing man sort or man tsort at the command line.

Should you want to reverse the order of sorting, use the -r option. You can also suppress printing of identical lines with the -u option.

A very practical feature of sort is its ability to sort using field keys. A field is a string of text separated from other fields with a certain single character. For example, the fields in /etc/passwd are separated with a colon (:). So, if you wanted, you could sort /etc/passwd by the user ID, group ID, comments field, home catalog, or shell. To do this, use the -t option followed by the character used as the separator, and the number of the field that will be used as the sort key followed by the number of the last field where the key will end; for example, sort -t : -k 5,5 /etc/passwd sorts the password file by the comment field, which is the place where full user names like "John Smith" are stored. But sort -t : -k 3,4 /etc/passwd sorts the same file using both the user ID and the group ID. If you omit the second number, sort will assume that the key starts at the given field and continues to the end of each line. Try this yourself, and observe the differences. (When numeric sorting looks wrong, add the -g option).

Also, note that a whitespace transition is the default separator -- so if fields are already separated by blank characters, you may omit the separator and use -t alone. (Notice also that numbering of fields starts with 1.)

For even finer control, you can use keys and offsets. Offsets are separated from keys with a dot, as in -k 1.3,5.7, which means that the sort key should start on the third character of the first field, and end at the seventh character of the fifth field (offsets too are numbered from 1). When would you need this? Well, I use it from time to time for sorting Apache logs; the key and offset notation lets me skip the date fields.

Another option to watch out for is -b, which tells sort to ignore blank characters (spaces, tabs, etc.) and treat the first non-blank character on the line as the start of the sort key. Also, if you use that option, offsets will be counted from the first non-blank character (useful when the field separator is not a blank character and when the fields may contain strings starting with blank characters).

Further modifications of the sorting algorithm are possible with these options: -d (use only letters, digits, and blanks for sort keys), -f (turn off case recognition and treat lowercase and uppercase characters as identical), -i (ignores non-printing ASCII characters), -M (sorts lines using three-letter abbreviations of month names: JAN, FEB, MAR, ...), -n (sorts lines using only digits, -, and commas, or other thousands separator). These options, as well as -b and -r, can be used as part of a key number, in which case they apply to that key only and not globally, like they do when they are used outside key definitions.

As an example of the use of a key number, consider:

sort -t: -k 4g,4 -k 3gr,3 /etc/passwd

This will sort the passwd file by group ID and within groups by userid, backwards.

But that's not all that sort is capable of. It can also resolve ties that happen when the keys you used cannot be used to decide which line is first. To add hints for resolving ties, add another -k option and follow it with the field and (optional) offset, using the same notation as the one you used for defining keys; for example, sort -k 3.4,4.5 -k 7.3,9.4 /etc/passwd sorts lines using keys that begin at the fourth character of the third key and end at the fifth character of the fourth key and use the third character of the seventh field and the fourth character of the ninth field to resolve ties.

The last group of options deals with input, output, and temporary files. For example, the -c option, when used in sort -c < file, checks if the input file has been sorted yet (you can use other options as well), and if it has, reports an error. This is handy for making checks before processing large files that may take a long time to sort. When you use the -u option together with the -c option, it will be interpreted as a request to check that there are no two identical line in the input file.

Also important when you are processing large files is the -T option used to specify an alternative directory for temporary files (they are removed after sort finishes work) instead of the default /tmp.

You can use sort to process more than one file at a time, and there are basically two ways to do it: you can use cat to concatenate them first, as in:

cat file1 file2 file3 | sort > outfile

Or, you could use this command:

sort -m file1 file2 file3 > outfile

There is one condition in the second case: each input file must be sorted before they are all sent to sort -m together. That may look like an unnecessary burden, but in fact it speeds up work and saves precious system resources. Oh, and don't forget the -m option. You can use the -u option here to suppress printing of identical lines.

If you need a more esoteric kind of sort routine, you might want to check out the tsort command, which performs a topological sort on a file. The difference between a topological and standard sort is shown in Listing 2 (you can download happybirthday.txt from Resources).

$ cat happybirthday.txt

Happy Birthday to You!

Happy Birthday to You!

Happy Birthday Dear Tux!

Happy Birthday to You!

$ sort happybirthday.txt

Happy Birthday Dear Tux!

Happy Birthday to You!

Happy Birthday to You!

Happy Birthday to You!

$ tsort happybirthday.txt

Dear

Happy

to

Tux!

Birthday

You!

Of course, that isn't a very useful demonstration of what you'd use tsort for -- just an illustration of how different the output of the two commands is.

tsort is generally used for solving a logic problem in which it's necessary to predict a total order from observed partial orders; for example, (from the tsort info page):

tsort <<EOF a b c d e f b c d e EOF

will produce the output

      a
      b
      c
      d
      e
      f

Questions or comments? I'd love to hear from you -- send mail to jacek@artymiak.com.

Next time, we'll delve into tr.


Resources

About the author

Jacek Artymiak works as a freelance consultant, developer, and writer. Since 1991 he's been developing software for many commercial and free variants of UNIX and BSD operating systems (AIX, HP-UX, IRIX, Solaris, Linux, FreeBSD, NetBSD, OpenBSD, and others), as well as MS-DOS, Microsoft Windows, Mac OS, and Mac OS X. Jacek specializes in business and financial application development, Web design, network security, computer graphics, animation, and multimedia. He's a prolific writer on technology subjects and the coauthor of "Install, Configure, and Customize Slackware Linux" (Prima Tech, 2000) and "StarOffice for Linux Bible" (IDG Books, 2000). Many of Jacek's software projects can be found at SourceForge. You can learn more about him at his personal Web site and contact him at jacek@artymiak.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11295
ArticleTitle=Tip: Sorting files with sort and tsort
publish-date=03062003
author1-email=jacek@artymiak.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers