Tip: Remove duplicate lines with uniq
Get to know your textutils
After sorting, you may discover that some lines are duplicated. Sometimes
this duplicate information is not needed and can be removed to save space
on disk. The lines of text do not have to be sorted, but you should
uniq compares lines as it reads them and will
remove only two or more consecutive lines. The following example
illustrates how it works in practice:
Listing 1. Removing duplicate lines with uniq
$ cat happybirthday.txt Happy Birthday to You! Happy Birthday to You! Happy Birthday Dear Tux! Happy Birthday to You! $ sort happybirthday.txt Happy Birthday Dear Tux! Happy Birthday to You! Happy Birthday to You! Happy Birthday to You! $ sort happybirthday.txt | uniq Happy Birthday Dear Tux! Happy Birthday to You!
Be warned that it is a bad idea to use
uniq or any other tool
to remove duplicate lines from files containing financial or other
important data. In such cases, a duplicate line almost always means
another transaction for the same amount, and removing it would cause a lot
of trouble for the accounting department. Do not do it!
What if you wanted to make your job a little easier and display, say, only
the unique or duplicate lines? You can do this with the
-d (duplicate) options, like this:
Listing 2. Using the -u and -d options
$ sort happybirthday.txt | uniq -u Happy Birthday Dear Tux! $ sort happybirthday.txt | uniq -d Happy Birthday to You!
You can also get some statistics out of
uniq with the
Listing 3. Using the -c option
$ sort happybirthday.txt | uniq -uc 1 Happy Birthday Dear Tux! $ sort happybirthday.txt | uniq -dc 3 Happy Birthday to You!
uniq compared just full lines it would still be useful,
but that's not the end of this command's functionality. Especially handy
is its ability to skip the given number of fields, using the
-f option followed by the number of fields to skip. This is
very useful when you are looking at system logs. Quite often, some entries
are duplicated many times, which makes it difficult to look at logs. Using
uniq won't work, because every entry begins with a
different timestamp. But if you tell it to skip all time fields, suddenly
your logs will become more manageable. Try
uniq -f 3 /var/log/messages and see for yourself.
There is also another option,
-s, which works just like
-f but skips the given number of characters. You can use
fields first, then characters. And what if you wanted to use only a preset
number of characters for comparison? Try the
Questions or comments? I'd love to hear from you -- send mail to firstname.lastname@example.org.
Next time, we'll take a look at
nl. See you then.