Text processing with UNIX

Tools for translating and manipulating text

The origin of UNIX® lies in simple text processing, and its command-line environment remains one of the most powerful text processing tools available. By combining a series of simple commands to make up a complex text transformation, the tools available from UNIX let you build nearly any text processing engine you could need.

Chris Herborth (chrish@pobox.com), Freelance, Freelance Writer

Photo of Chris HerborthChris Herborth is an award-winning senior technical writer with more than 10 years of experience writing about operating systems and programming. When he's not playing with his son Alex or hanging out with his wife Lynette, Chris spends his spare time designing, writing, and researching (that is, playing) video games.



01 August 2006

Also available in Chinese Russian

Introduction

Back in the early days of UNIX®, the folks messing around with this new operating system quickly found a niche to fill; people at universities needed a decent text processing environment. Because of the processing speed and amount of memory in computers those days, programs had to be small and, relatively, simple. This led to UNIX's famous design philosophy: "a suite of tools working together to get a job done." By combining several small, but powerful, text processing tools through UNIX pipes, text can be transformed and manipulated in a myriad of ways.

In this article, you'll take a quick look at getting text from files and programs, simple transliterations using the tr command, and complex search and replace actions using the sed command. Then you'll do it all again using the Perl programming and scripting language so that you can see how Perl can act as a powerful replacement for both the tr and sed commands.

Before you start

If you'd like to follow along and experiment with the examples in this article, make sure you've got access to a UNIX command-line environment. This could be on your local machine through a terminal emulator (often called Terminal on modern desktops; if you're stuck using Windows®, Cygwin will do nicely), or on a remote system accessed through SSH.

The shell syntax used in the examples here is suitable for GNU Bash; please refer to your shell's manual for the specific syntax you'll need to use (or consider switching to Bash, I suppose).

Getting the text rolling

Before you can start manipulating text using a few of UNIX's plethora of text utilities, you need to know how to get at some text. And before you do that, you'll need to understand UNIX's standard input/output (I/O) streams.

The standard C library (and thus, every UNIX program) defined three standard streams: input, output, and error. These are sometimes called stdin, stdout, and stderr after the global variables that represent them in every C program.

When you redirect a program's output to a file using the > operator in a shell, you're sending its standard output (stdout) stream to the file. For example: ls > this-dir sends the output from ls to a file named this-dir.

When you redirect a program's input from a file using the < operator in a shell, you're pulling the file's contents into its standard input (stdin) stream. For example: sort < this-dir reads the contents of a file named this-dir and provides it as input to the sort command.

The other common operator for redirecting the standard streams is the | (pipe) operator, which connects the standard output stream of the program on the left hand side to the standard input stream of the program on the right. For example: ls | sort does the same thing as the previous two examples without requiring a temporary file; the output from ls goes straight through the sort command.

If you've been paying attention, you've probably noticed that the standard error (stderr) stream isn't represented in any of those examples. Like the standard output stream, stderr can be redirected or piped, but you need to tell the shell that you want to deal with stderr instead of stdout.

Redirect the standard error stream to a file using the 2> operator. You'll see this most often when dealing with commands that have useful error output, such as the make tool used to build UNIX programs: make 2> build-errors.

That command runs make and sends any error messages to the build-errors file. Similarly, you'd use 2| to pipe stderr to another program.

If you're interested in the gritty details, the other streams have numbers as well, although they're almost never used (0 is standard input, and 1 is standard output), except in one surprisingly common operator. In the example shown in Listing 1, the 2>&1 operator ties the standard error stream to the standard output stream. Combined with the > operator, you'll get stderr and stdout in the same file.

Listing 1. Tying the standard error stream to the standard output stream
make > build-output 2>&1

Commands, finally

There are two standard UNIX commands that are often used to generate some textual output: cat and echo.

The cat command reads each of the files specified in its arguments and writes the content of the files to stdout. The echo command writes its arguments to stdout. You'll often find them as part of a more complex command pipeline (see Listing 2).

Listing 2. Using cat and echo
cat file1 file2 ... filen
echo arguments...

But what if you just want the first part of a file, or the last part? There are cat variants called head and tail (see

Listing 3

) that will do what you want, printing the first ten lines or last ten lines, respectively (you can specify a different number of lines using the -n option to either one).

Listing 3. Using head and tail
head file1 file2 ... filen
tail file1 file2 ... filen

The tail command has another handy option, -f (follow). This tells tail to print the last ten lines of the specified file but, instead of exiting, it waits for more text to appear in the file and prints it as it appears. You can use this to follow the output in an error log, for example, to see what errors are appearing as they're written to the log.

Translating text

Now that you know at least five different ways of generating some text, let's look at doing some simple translations on it.

The tr command lets you translate characters in one set to the corresponding characters in a second set. Let's take a look at a few examples (Listing 4) to see how it works.

Listing 4. Using tr to translate characters
echo "a test" | tr t p
echo "a test" | tr aest 1234
echo "a test" | tr -d t
echo "a test" | tr '[:lower:]' '[:upper:]'

Looking at the output of these commands (see Listing 5) gives you a clue about how tr works (here's a hint: it's a direct replacement of characters in the first set with the corresponding characters from the second set).

Listing 5. What has tr done?
chrish@dhcp3 [199]$ echo "a test" | tr t p
a pesp

chrish@dhcp3 [200]$ echo "a test" | tr aest 1234
1 4234

chrish@dhcp3 [201]$ echo "a test" | tr -d t
a es

chrish@dhcp3 [202]$ echo "a test" | tr '[:lower:]' '[:upper:]'
A TEST

The first and second examples are simple enough, replacing one character for another. The third example, with the -d option (delete), removes the specified characters completely from the output. This is often used to remove carriage returns from DOS text files to turn them into UNIX text files (see Listing 6). Finally, the last example uses character classes (those names inside of [: :]) to convert all lower-case letters into upper-case letters. Portable Operating System Interface-standard (POSIX-standard) character classes include:

  • alnum: alphanumeric characters
  • alpha: alphabetic characters
  • cntrl: control (non-printing) characters
  • digit: numeric characters
  • graph: graphic characters
  • lower: lower-case alphabetic characters
  • print: printable characters
  • punct: punctuation characters
  • space: whitespace characters
  • upper: upper-case characters
  • xdigit: hexadecimal characters
Listing 6. Converting DOS text files into UNIX text files
tr -d '\r' < input_dos_file.txt > output_unix_file.txt

Although the tr command respects C locale environment variables (try man locale for more information about these), don't expect it to do anything sensible with UTF-8 documents, such as being able to replace lower-case accented characters with appropriate upper-case characters. The tr command works best with ASCII and the other standard C locales.

Complex search and replace with sed

The single-character replacement (or removal) abilities provided by the tr command are great in specific situations, but aren't tremendously flexible. What if you need to replace one word with another, or a series of spaces and tabs with a single space?

Luckily, you have the sed command (Stream EDitor), which provides powerful regular expression matching and replacement. Regular expressions are complex pattern specifications built using building blocks that end up looking more like modem line noise as the pattern gets more complex. A detailed tutorial on regular expressions is something for another article, but you'll take a quick look at some handy patterns for use with sed here.

You can see the basic format of a sed command in Listing 7. Pattern is the regular expression used to match against the input (usually either piped in from another program, or redirected from a text file), and replacement is the text to insert in place of the text matched by the pattern. The flags are single characters that control the substitution's behavior. The most commonly-used flag is g (apply the replacement to all non-overlapping instances that match the pattern instead of just the first match).

The pattern and replacement can be practically anything, and they don't need to have a 1:1 relationship like they do with the tr command.

Listing 7. The sed command
sed -e s/pattern/replacement/flags

The simplest pattern is just a string of one or more characters. Listing 8, for example, replaces one word with another.

Listing 8. The easiest regular expression
chrish@dhcp3 [334]$ echo "Replace one word" | sed -e s/one/another/
Replace another word

You can enclose one or more characters in square brackets to create a set; any character in the set will match. Let's change all the vowels into underscores in Listing 9.

Listing 9. Matching any of a set
chrish@dhcp3 [338]$ echo "This is a test" | sed -e s/[aeiouy]/_/g
Th_s _s _ t_st

Note the use of the g flag so that you apply the pattern/replacement to every match instead of just the first one.

The sed command also knows about the named character classes that the tr command supports; these are defined by POSIX, but the syntax here is a little different. Listing 10 shows you how to replace any whitespace (tabs, spaces, and so forth):

Listing 10. Matching anything from a named character class
chrish@dhcp3 [345]$ echo -e 'hello\tthere'   
hello   there
chrish@dhcp3 [346]$ echo -e 'hello\tthere' | sed -e 's/[[:space:]]/, /'
hello, there

The -e flag to the echo command tells it to expand C-style escaped characters; in this case, it's going to turn \t into a tab character for you.

You can also use "." (a period) to match any single character. This is really handy if you're dealing with data that varies slightly, or data that has special characters that would be awkward to escape. For example, I often use a . when I'm matching quotes, so I don't have to escape the quotes in the shell. Listing 11 shows you an accident a new regular expression user could make using this pattern.

Listing 11. This probably isn't what you wanted
chrish@dhcp3 [339]$ echo "This is a test" | sed -e s/./_/g
______________

Now that you've seen the very basics, there are a few additional pattern modifiers; you're also going to start using the -E option now instead of -e in order to use advanced regular expressions. The ? character means to match zero or one instance of the previous pattern element; the * character means to match zero or more of the previous element. The + character means to match one or more of the previous element. The ^ character matches the start of a line and $ matches the end of a line. You can see this in action, as shown in Listing 12.

Listing 12. Multiple matches in action
chrish@dhcp3 [356]$ echo "hellooooo" | sed -E 's/o?$/_/g'
helloooo_
chrish@dhcp3 [357]$ echo "hellooooo" | sed -E 's/o*$/_/g'
hell_
chrish@dhcp3 [358]$ echo "hellooooo" | sed -E 's/o+$/_/g'
hell_

If you wrap pattern elements in parentheses, you can use the matched contents in the replacement string. These are called groups, and they make regular expression search and replace operations very powerful, and rather hard to read. For example, in Listing 13, you match one or more l (el) characters followed by zero or more o characters. They're replaced with the contents of the second group and then the first group, effectively swapping them. Note how you refer to the groups as a backslash followed by the number of the group in the pattern.

Listing 13. Match groups
chrish@dhcp3 [361]$ echo "hellooooo" | sed -E 's/(l+)(o*)$/\2\1/g'
heoooooll

You can match a specific number of patterns by specifying the number of matches in braces. For example, the pattern o{2} would match two (and only two) o characters.

Oh, and one last thing; you can use any of these special characters literally (that is, as themselves) in a pattern by escaping them using the \ character.

Putting it together

Now that you've been exposed to some really simple regular expressions, let's try something useful. Given the output of ls -l (the long listing of files), you'll pull out the permission information, size, and name. Listing 14 shows some sample ls -l output for you to work with.

Listing 14. Typical ls -l output
chrish@dhcp3 [365]$ ls -l | tail
drwx------   3 chrish    wheel   102 Jun 14 21:38 gsrvdir501
drwxr-xr-x   2 chrish    wheel    68 Jun 16 16:01 hsperfdata_chrish
drwxr-xr-x   3 root      wheel   102 Jun 14 23:38 hsperfdata_root
-rw-r--r--   1 root      wheel   531 Jun 14 10:17
 illustrator_activation.plist
-rw-r--r--   1 root      wheel   531 Jun 14 10:10 indesign_activation.plist
-rw-------   1 nobody    wheel    24 Jun 16 16:01 objc_sharing_ppc_4294967294
-rw-------   1 chrish    wheel   132 Jun 16 23:50 objc_sharing_ppc_501
-rw-------   1 security  wheel    24 Jun 16 10:04 objc_sharing_ppc_92
-rw-r--r--   1 root      wheel   531 Jun 14 10:05 photoshop_activation.plist
-rw-r--r--   1 root      wheel   928 Jun 14 10:17 serialinfo.plist

As you can see, there are seven columns here:

  • Permissions
  • Number of links
  • Owner
  • Group
  • Size
  • Last modification time
  • Name

Let's make up some regular expressions to match each of these:

  • .([r-][w-][x-]){3} -- permissions (Use . to match the first character, because it can be any of several different special characters.)
  • [[:digit:]]+ -- number of links
  • [A-Za-z0-9_\-\.]+ - -- owner (You can also use this for matching the group.)
  • [[:digit:]]+ -- size
  • .{3} [0-9 ]{2} [0-9 ][0-9]:[0-9][0-9] -- modification time (You could simplify this a bit, since all of the files were all modified in June, and you could make it more exact by specifying the month names.)
  • .+$ - name (After everything else, you'll match all the characters up to the end of the line.)

In between, you'll have to join these patterns with [[:space:]]+, since you have no idea the columns are separated by spaces or tabs, or a combination. You'll also want to put the permissions, size, and name into groups so that you can use them in the replacement. As you can see in Listing 15, regular expressions quickly become difficult to read.

Listing 15. The completed regular expression. Shield your eyes!
(.([r-][w-][x-]){3})[[:space:]]+[[:digit:]]+[[:space:]]+([A-Za-z0-9_\-\.]
+[[:space:]]+){2}([[:digit:]]+)[[:space:]]+.{3} [0-9 ]{2} [0-9
 ][0-9]:[0-9][0-9][[:space:]]+(.+)$

If you look carefully at that monster regular expression pattern, you'll discover five groups:

  1. The entire permissions block
  2. The last-matched rwx group in the permissions block
  3. Group (the last thing matched in the owner/group part of the pattern)
  4. Size
  5. Name

In Listing 16, you'll change the ls -l output to show the filename, permissions, and size.

Listing 16. Rearranged output
chrish@dhcp3 [382]$ ls -l | tail | sed -E
 's/(.([r-][w-][x-]){3})[[:space:]]+[[:digit:]]+[[:space:]]+([A-Za-z0-9_\-\.
 ]+[[:space:]]+){2}([[:digit:]]+)[[:space:]]+.{3} [0-9 ]{2} [0-9
 ][0-9]:[0-9][0-9][[:space:]]+(.+)$/\5 (\1) has \4 bytes of data/'
gsrvdir501 (drwx------) has 102 bytes of data
hsperfdata_chrish (drwxr-xr-x) has 68 bytes of data
hsperfdata_root (drwxr-xr-x) has 102 bytes of data
illustrator_activation.plist (-rw-r--r--) has 531 bytes of data
indesign_activation.plist (-rw-r--r--) has 531 bytes of data
objc_sharing_ppc_4294967294 (-rw-------) has 24 bytes of data
objc_sharing_ppc_501 (-rw-------) has 132 bytes of data
objc_sharing_ppc_92 (-rw-------) has 24 bytes of data
photoshop_activation.plist (-rw-r--r--) has 531 bytes of data
serialinfo.plist (-rw-r--r--) has 928 bytes of data

Victory! You've completely transformed the output.

Doing it with Perl

The Perl programming and scripting language (see Resources) is often used as a supremely powerful replacement for the tr and sed commands you've just looked at. A short Perl program, often typed directly on the command line, can sometimes do more than the equivalent tr or sed command line.

Perl's -p option tells it to read and process each line from standard input and print the results to standard output. The -e option lets you specify a Perl expression (a program, actually) on the command line.

Listing 17 shows you how to duplicate the examples in Listing 5 from within Perl.

Listing 17. Using Perl to do tr's job
chrish@dhcp3 [248]$ echo a test | perl -p -e 'tr/t/p/;'
a pesp

chrish@dhcp3 [249]$ echo a test | perl -p -e 'tr/aest/1234/;'
1 4234

chrish@dhcp3 [250]$ echo a test | perl -p -e 'tr/t//d;'
a es

chrish@dhcp3 [251]$ echo a test | perl -p -e 'tr/a-z/A-Z/;'
A TEST

Perl's tr statement has a slightly different syntax, more like sed's search and replace expressions. Note also that you specified the range of lower-case and upper-case characters in the last example.

The regular expression support in Perl is excellent, and the sed examples above will all work as valid Perl statements. Listing 18 shows you the ls -l example from Listing 16 in Perl; no changes, other than the Perl command-line syntax, were required.

Listing 18. Rearranging ls output with Perl
chrish@dhcp3 [384]$ ls -l | tail | perl -p -e
 's/(.([r-][w-][x-]){3})[[:space:]]+[[:digit:]]+[[:space:]]+([A-Za-z0-9_\-\.]
+[[:space:]]+){2}([[:digit:]]+)[[:space:]]+.{3} [0-9 ]{2} [0-9
 ][0-9]:[0-9][0-9][[:space:]]+(.+)$/\5 (\1) has \4 bytes of data/'
gsrvdir501 (drwx------) has 102 bytes of data
hsperfdata_chrish (drwxr-xr-x) has 68 bytes of data
hsperfdata_root (drwxr-xr-x) has 102 bytes of data
illustrator_activation.plist (-rw-r--r--) has 531 bytes of data
indesign_activation.plist (-rw-r--r--) has 531 bytes of data
objc_sharing_ppc_4294967294 (-rw-------) has 24 bytes of data
objc_sharing_ppc_501 (-rw-------) has 132 bytes of data
objc_sharing_ppc_92 (-rw-------) has 24 bytes of data
photoshop_activation.plist (-rw-r--r--) has 531 bytes of data
serialinfo.plist (-rw-r--r--) has 928 bytes of data

The nice thing about this is that you can perfect your regular expressions using sed or Perl, and you can still use them on systems where only one, or the other, is available. And, with Perl, you've got a full range of programming constructs you can take advantage of for doing even more complex text processing.

Summary

Using powerful tools like sed and Perl, and the magic of regular expressions, you can easily do complex text processing tasks directly on the UNIX command line. This lets you efficiently combine several commands to get your text processing jobs done correctly.

Resources

Learn

  • Cygwin: Learn more about this UNIX environment for Windows.
  • iTerm: iTerm is a good replacement for the Mac OS X Terminal.
  • Perl: Learn more about Perl.
  • Regular-Expressions.info: This site provides regular expressions tutorials, examples, and references.
  • Regular Expression Library: This library has a large repository of regexes.
  • "Regular Expressions in PHP" (developerWorks, January 2006): This tutorial discusses the differences between POSIX and PCRE, and how you can use regular expressions and PHP V5.
  • AIX and UNIX: Want more? The developerWorks AIX and UNIX zone hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials.
  • developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
  • Podcasts: Tune in and catch up with IBM technical experts.

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=151326
ArticleTitle=Text processing with UNIX
publish-date=08012006