After sorting, you may discover that some lines are duplicated. Sometimes this duplicate information is not needed and can be removed to save space on disk. The lines of text do not have to be sorted, but you should remember that uniq compares lines as it reads them and will remove only two or more consecutive lines. The following example illustrates how it works in practice:
Listing 1. Removing duplicate lines with uniq
$ cat happybirthday.txt Happy Birthday to You! Happy Birthday to You! Happy Birthday Dear Tux! Happy Birthday to You! $ sort happybirthday.txt Happy Birthday Dear Tux! Happy Birthday to You! Happy Birthday to You! Happy Birthday to You! $ sort happybirthday.txt | uniq Happy Birthday Dear Tux! Happy Birthday to You! |
Be warned that it is a bad idea to use uniq or any other tool to remove duplicate lines from files containing financial or other important data. In such cases, a duplicate line almost always means another transaction for the same amount, and removing it would cause a lot of trouble for the accounting department. Do not do it!
What if you wanted to make your job a little easier and display, say, only the unique or duplicate lines? You can do this with the -u (unique) and -d (duplicate) options, like this:
Listing 2. Using the -u and -d options
$ sort happybirthday.txt | uniq -u Happy Birthday Dear Tux! $ sort happybirthday.txt | uniq -d Happy Birthday to You! |
You can also get some statistics out of uniq with the -c option:
Listing 3. Using the -c option
$ sort happybirthday.txt | uniq -uc
1 Happy Birthday Dear Tux!
$ sort happybirthday.txt | uniq -dc
3 Happy Birthday to You!
|
If uniq compared just full lines it would still be useful, but that's not the end of this command's functionality. Especially handy is its ability to skip the given number of fields, using the -f option followed by the number of fields to skip. This is very useful when you are looking at system logs. Quite often, some entries are duplicated many times, which makes it difficult to look at logs. Using plain uniq won't work, because every entry begins with a different timestamp. But if you tell it to skip all time fields, suddenly your logs will become more manageable. Try uniq -f 3 /var/log/messages and see for yourself.
There is also another option, -s, which works just like -f but skips the given number of characters. You can use -f and -s together. uniq skips fields first, then characters. And what if you wanted to use only a preset number of characters for comparison? Try the -w option.
Questions or comments? I'd love to hear from you -- send mail to jacek@artymiak.com.
Next time, we'll take a look at nl. See you then!
- Download the example file for the code listings above, happybirthday.txt.
- Read the other tips in this series:
- Get to know your textutils
- Concatenating files with cat
- Reading text streams in chunks with head and tail
- Sorting files with sort and tsort
- Filtering files with tr
- Find even more info on these useful tools in the GNU text utilities manual (an expanded view of the same TOC lives at MIT, where you can also find a great list of even more useful GNU tools).
- Windows users can find these tools in the Cygwin package.
- Mac OS X users may want to try Fink, which installs a rich UNIX environment under the new Mac OS X.
- Something just not working for you? Try checking the frequently asked questions for GNU textutils.
- Need more introductory info before delving in to the tools we've covered here? Try starting with UNIXhelp for users.
- Of course, the classic work in this field is Unix Power Tools, from O'Reilly and Associates (Jerry Peek, Tim O'Reilly and Mike Loukides 1997; ISBN 1-56592-260-3).
- Enchanted with UNIX text utilities (and with UNIX as such)? Then you'll love the classic article by Thomas Scoville, UNIX as Literature.
- A developerWorks article on UNIX utilities as component architecture looks at UNIX from a different perspective.
- Learn how to use two of the most popular text editors available for Linux in these developerWorks tutorials, vi intro -- the cheat sheet method and Living in emacs.
- Find the Linux resource you're looking for in the developerWorks Linux zone.
Jacek Artymiak works as a freelance consultant, developer, and writer. Since 1991 he's been developing software for many commercial and free variants of UNIX and BSD operating systems (AIX, HP-UX, IRIX, Solaris, Linux, FreeBSD, NetBSD, OpenBSD, and others), as well as MS-DOS, Microsoft Windows, Mac OS, and Mac OS X. Jacek specializes in business and financial application development, Web design, network security, computer graphics, animation, and multimedia. He's a prolific writer on technology subjects and the coauthor of "Install, Configure, and Customize Slackware Linux" (Prima Tech, 2000) and "StarOffice for Linux Bible" (IDG Books, 2000). Many of Jacek's software projects can be found at SourceForge. You can learn more about him at his personal Web site and contact him at jacek@artymiak.com.