Skip to main content

Hone your regexp pattern-building skills

Handy regular expressions for system administration

Michael Stutz, Author, Freelance Developer
Michael Stutz is author of the book The Linux Cookbook, which he also designed and typeset -- using only open source software. His research interests include digital publishing and the future of the book. He has used various UNIX operating systems for 20 years. You can reach him at stutz@dsl.org.

Summary:  Add to your bag of tricks several handy techniques for crafting real-world regular expressions (regexps). Building regexps is a part of the daily life of any administrator. Learning to think in terms of pattern matching, in order to construct successful regexps that return the desired criteria, is a skill that takes both time and practice.

Date:  18 Jul 2006
Level:  Intermediate
Activity:  11279 views

Introduction

Everyday, a UNIX® administrator needs to build and use regular expressions (regexps) to match patterns of text. Most languages support some implementation of regexps. Some applications (like EMACS) have regexp search capabilities, and you can use regexps with many command-line tools. Whatever your application, the key to building good regexps is to recognize a pattern that fits only the data you need to match, so that nothing else from the input gets in the way.

Toward that end, this article steps through several regexp pattern-building techniques and shows how they can be helpful in various common situations.

Using regular expressions (regexps)

Unless noted, the examples in this article are Portable Operating System Interface-extended (POSIX-extended) regexps. If you use them on the command line (such as with the egrep utility), you should quote them as you would most any regexp. Remember, there are differences between regexp implementations, and you might have to adapt these to the particular tool, application, or language you're using.

Match whole lines

You know that the ^ metacharacter matches the beginning of the line and the $ matches the end of the line -- working together (as in ^$), they match blank lines. (The mirror image of this expression, $^, is an impossibility and it will never match a valid line.) This basic regexp is the basis for many complex regexps, and you should get in the habit of using it if you don't already. Use it to build patterns that find matches based on the entire contents of a line.

This is a good base pattern for searching through the user dictionary file (/usr/dict/words). (Some flavors of UNIX put it in /usr/share/dict/words.)

For example, say you forgot how the word fuchsia is spelled. Does it have an sh or cs in it? All you know for sure is that it begins with fu and ends in ia.

Try searching on this pattern:

$ egrep -i '^fu.*ia$' /usr/dict/words

The -i flag searches for matches regardless of case. In this example, fuchsia, since it is properly spelled, is one of the words returned.


Match lines based on length

Use the brace metacharacters ({}) to specify a certain number of matches to the regexp immediately preceding them, as described in Table 1. When you add them to the whole-line searches just described, you can specify line length.


Table 1. Meaning of brace metacharacters
ExampleDescription
{X}This character matches the preceding regexp X times.
{X,}This character matches the preceding regexp X or more times.
{X,Y}This character matches the preceding regexp at least X, but not more than Y times.

Not all implementations of extended regexps support braces. And again, depending on your implementation, you might have to escape them with a backslash first.

You can use this regexp to get a rundown of dictionary words sorted by their length. The exact numbers you get depends on the number of words in your local system's dictionary file but, nonetheless, it will look something like Listing 1. In this example, the most popular word length was nine letters, for which the dictionary has 32,380 matching words. The dictionary contains no words that are 25 letters or longer, and the longest word isn't the 21-letter disestablishmentarian that you might think it would be (there are 81 others equally long, including superincomprehensible and phoneticohieroglyphic); the honor for longest word in the UNIX dictionary is shared among five, including pathologicopsychological.


Listing 1. Counting the number of X-letter words in the dictionary
$ for i in `seq 1 32`
>  {
>   echo "There are" `egrep '^.{'$i'}$' /usr/dict/words  \
   | wc -l` "$i-letter words in the dictionary."
>  }
There are 52 1-letter words in the dictionary.
There are 155 2-letter words in the dictionary.
There are 1351 3-letter words in the dictionary.
There are 5110 4-letter words in the dictionary.
There are 9987 5-letter words in the dictionary.
There are 17477 6-letter words in the dictionary.
There are 23734 7-letter words in the dictionary.
There are 29926 8-letter words in the dictionary.
There are 32380 9-letter words in the dictionary.
There are 30867 10-letter words in the dictionary.
There are 26011 11-letter words in the dictionary.
There are 20460 12-letter words in the dictionary.
There are 14938 13-letter words in the dictionary.
There are 9762 14-letter words in the dictionary.
There are 5924 15-letter words in the dictionary.
There are 3377 16-letter words in the dictionary.
There are 1813 17-letter words in the dictionary.
There are 842 18-letter words in the dictionary.
There are 428 19-letter words in the dictionary.
There are 198 20-letter words in the dictionary.
There are 82 21-letter words in the dictionary.
There are 41 22-letter words in the dictionary.
There are 17 23-letter words in the dictionary.
There are 5 24-letter words in the dictionary.
There are 0 25-letter words in the dictionary.
There are 0 26-letter words in the dictionary.
There are 0 27-letter words in the dictionary.
There are 0 28-letter words in the dictionary.
There are 0 29-letter words in the dictionary.
There are 0 30-letter words in the dictionary.
There are 0 31-letter words in the dictionary.
There are 0 32-letter words in the dictionary.
$ 


Match words

The \< and \> enclosures are useful pattern builders: They enclose a whole word to be matched -- they won't match an enclosed pattern unless that pattern is a word of its own. A word is defined as any number of word-forming characters (numbers, letters, and the underscore characters) that are delineated by a nonword character on both sides. Nonword characters include any of the following:

  • The beginning of the line
  • A whitespace character
  • A punctuation character
  • The end of the line
  • Any other character excluding letters, numbers, or the underscore

These enclosures can be a great timesaver, but they're often underutilized -- probably because not every regexp implementation supports them. If yours does, get in the habit of using them.

Enclose a word to match that word alone, like so:

\<system\>

The regexp in this example won't match the word ecosystem, systemic, or system/70, nor will it match lines where the pattern system appears just anywhere on the line -- it will only match those lines where system exists as a word of its own.

Combine the enclosures with groupings in parentheses to match parts of words.

To match lines containing words that begin with pre, use:

\<\(pre\).*\>

The preceding example matches lines containing the words preface and preposterous, but not spread or Dupre.


Match doubled words

Here's a quick way to use the word enclosure to match doubled words -- a word followed by some space and the same word again. You can also use a backreference, which is a recursive feature of most contemporary regexp implementations that matches a part of the pattern itself. (Enclose the part of the pattern you want to reference in parentheses, and call the backreference with a backslash followed by the number of the enclosure you're referencing: 1 for the first parentheses grouping, 2 for the second, and so on.)

To find doubled words, search for a word followed by any number of spaces and the exact word again specified by a backreference to the first enclosed parenthetical:

(\<.*\>)( )+\1

This example matches contractions and any kind of word, but it won't match doubled words that are separated by punctuation, such as It's been a long, long time.

To match all doubled words, including those separated by spaces and an optional punctuation character, use this:

(\<.*\>).?( )+\1

If you're grepping these regexps, it's important that you use the -i flag so that you'll find matches regardless of case.


Match hours

Let's move on to another category that you'll constantly encounter: the time and date. There are certain considerations for making regexps that pull out the right patterns.

You can't just search for any two-digit numbers to match minutes and seconds, since they're only counted from 0-59; to match them, you have to bracket the appropriate ranges for the tens and ones for columns:

  • To match hours in either of the standard 12- or 24-hour formats, use this:
    (([0-1]?[0-9])|([2][0-3])):([0-5][0-9])(:[0-5][0-9])?
    

  • To match time in 12-hour AM/PM format, with or without seconds, and even match times that don't have a trailing AM or PM identifier, in upper- or lower-case, use this:
    ([^0-9])([0-1]?[0-9]){1}(((:([0-5]){1}([0-9]){1}){1,2})|(( )?([AP]M)|([ap]m)))?
    

Without the beginning negation statement in the last example, it will match times without colons, which -- depending on your input data -- could be references to mediumwave broadcast stations (known as AM radio in the US), such as 1450 AM.


Match months

Matching any of the 12 months requires a list separated by the | operator, but dates are sometimes abbreviated in various ways:

  • To find any one of the 12 months in either its full spelling or three-letter abbreviation, use the following (on a single line):
    Jan(uary)?|Feb(uary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|
    Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?
    

  • You can get fancy and search for variations -- the full spelling or the three-letter abbreviation -- only when it's followed by a space or a period with the following (on a single line):
    Jan(uary| |\.)|Feb(uary| |\.)|Mar(ch| |\.)|Apr(il| |\.)|May( |\.)|Jun(e| |\.)|Jul(y| |\
    .)|Aug(ust| |\.)|Sep(tember| |\.)|Oct(ober| |\.)|Nov(ember| |\.)|Dec(ember| |\.)
    

Notice that in both of these examples, May is a special exception. It's the only month whose full spelling is the same as its three-letter abbreviation, so a successful match must contain either one of the two variations for the abbreviation -- this is so that words like "Mayflower" won't cause a false positive.

These examples will also fail (by returning false positives) when a matching pattern is preceded by characters other than a space or the beginning of the line. This isn't likely to occur in English prose, but it could happen in, say, program source code where a variable name such as NumOct is in use.

To fix these problems, do the following:

  • Enclose the whole regexp in parentheses and precede it with another qualifier, matching at either the beginning of a line or after a space character, like so (on a single line):
    (^| )(Jan(uary| |\.)|Feb(uary| |\.)|Mar(ch| |\.)|Apr(il| |\.)|
    May( |\.)|Jun(e| |\.)|Jul(y| |\.)|Aug(ust| |\.)|Sep(tember| |
    \.)|Oct(ober| |\.)|Nov(ember| |\.)|Dec(ember| |\.))
    

  • Another way to do it is to precede the regexp with a qualifier to match non-alphanumeric characters, like this (on a single line):
    ([^A-Za-z0-9])(Jan(uary| |\.)|Feb(uary| |\.)|Mar(ch| |\.)|
    Apr(il| |\.)|May( |\.)|Jun(e| |\.)|Jul(y| |\.)|Aug(ust| |\.)|
    Sep(tember| |\.)|Oct(ober| |\.)|Nov(ember| |\.)|Dec(ember| |\.))
    

But there's still a potential gotcha -- none of these examples are reliable for searching through passages of English prose, because there's a chance that they'll return false matches, such as with words like "Janelle" or "Augury". To fix that, you must enclose each month in word enclosures.

The beginning of the article said that a good regexp is one that gives you only the data you need to match so that nothing else from the input gets in the way. This wording is deliberate, because when it comes to building regexps, it's all about context. The preceding examples are perfect for some cases, without the additional step of adding word enclosures. In other cases, they can be greatly simplified -- for example, if you're searching log files that contain only numeric data with dates using uppercase letters, a regexp like [A-S] might be all you need to match lines containing the month names.


Match dates

You can combine a few of the quantity matches, as described in Table 1, to match dates.

To match "month, day, years", use this regexp (because an apostrophe character is part of the regexp, you have to quote it with double quotes, as shown):

"[A-Za-z]{3,10}\.? [0-9]{1,2}, ([0-9]{4}|'?[0-9]{2})"

This regexp matches nine different date formats:

  1. MONTH [D]D, YY
  2. MONTH [D]D, 'YY
  3. MONTH [D]D, YYYY
  4. MON. [D]D, YY
  5. MON. [D]D, 'YY
  6. MON. [D]D, YYYY
  7. MON [D]D, YY
  8. MON [D]D, 'YY
  9. MON [D]D, YYYY

False positives in this regexp include "Order 99, 99"; to eliminate these, you can combine this regexp with the regexp for months, as described above, so that it matches only real month names. Also, change the numeric ranges to avoid false matches and double the possible formats to 18 by making the comma optional.

This makes for a long regexp. Try it:

"([^A-Za-z0-9])(Jan(uary| |\.)|Feb(uary| |\.)|Mar(ch| |\.)|
Apr(il| |\.)|May( |\.)|Jun(e| |\.)|Jul(y| |\.)|Aug(ust| |\.)|
Sep(tember| |\.)|Oct(ober| |\.)|Nov(ember| |\.)|
Dec(ember| |\.)) [0-3]?[0-9]{1}(,)? ([0-9]{4}|'?[0-9]{2})"

Again, craft the regexp to meet your needs. It's always easier to match a pattern as it exists in context of the particular input you have -- not as it might exist independent of a data set. Future generations might note that the previous long regexp still has a Y10K bug in it, because the highest possible year it will match is 9999.


Match integers

As you've seen in the last few examples, a range enclosed in brackets is good for matching numbers.

To match an integer of any length, follow a numeric range with +; to include negative values, and precede it with an optional negative-sign match (a hyphen):

-?[0-9]+

The preceding example matches zero as well, because 0 is one of the possible characters in the specified range.

Enclosures in parentheses are also good for matching numbers. To match any decimal number, extend the previous regexp with an optional enclosure containing a decimal point followed by one or more numbers:

-?[0-9]+(\.[0-9]+)?

Use brackets to specify a decimal number of a certain scale. For example, to match positive numbers only with a scale of five or more decimal points, use this:

[^-][0-9]+\.([0-9]){5,}


More real-world matches

Ranges followed by bracketed metacharacters are useful in finding numbers that fit any specific format. Putting together some of the techniques described thus far, you can build regexps to match all kinds of data:

  • To match a US telephone number, use:
    ((\([2-9][0-9]{2}\))?\ ?|[2-9][0-9]{2}(?:\-?|\ ?))[2-9][0-9]{2}[- ]?[0-9]{4}
    

    This regexp matches US telephone numbers in any of 15 formats:

    1. (NPA) PRE-SUFF
    2. (NPA) PRE SUFF
    3. (NPA) PRESUFF
    4. (NPA)PRE-SUFF
    5. (NPA)PRE SUFF
    6. (NPA)PRESUFF
    7. NPA PRE-SUFF
    8. NPA PRE SUFF
    9. NPA PRESUFF
    10. NPAPRE-SUFF
    11. NPAPRE SUFF
    12. NPAPRESUFF
    13. PRE-SUFF
    14. PRE SUFF
    15. PRESUFF

    It also matches US toll-free WATS numbers; although the "1-" prefix of a 1-800 or other toll-free number isn't part of the match, it matches the 10 digits of the number itself. The same is true for US numbers preceded by a 1 or 1+ and any number of spaces -- the long-distance dialing prefix itself isn't matched, but as long as an actual number follows it, this regexp will pull it out.

  • To match an e-mail address from either two- or three-digit domains, try the following:
    \<[^@]+\>@[a-zA-Z_\.]+?\.[a-zA-Z]{2,3}
    

  • To match just about all modern-day URLs, use this regexp:
    (((http(s)?|ftp|telnet|news)://|mailto:)[^\(\)[:space:]]+)
    

    This works pretty well, but matching URLs isn't as easy as you might think. A regexp to match any possible URL, as defined in RFC 1738, has been published in "Regexp for URLs" (see the Resources section), and it's huge and intimidating. It should be tucked into a [:url:] class by now (it would be nice to have all kinds of new classes for dealing with similar data categories, such as [:email:]).


Conclusion

This article touched on some of the pattern-building techniques for writing regexps and the ways they can be used for matching particular types of data that administrators encounter all the time. In the process, you've been shown a number of handy, real-world regexps that you can add to your administrative arsenal.


Resources

Learn

Get products and technologies

  • Build your next development project with IBM® trial software, available for download directly from developerWorks.

Discuss

About the author

Michael Stutz is author of the book The Linux Cookbook, which he also designed and typeset -- using only open source software. His research interests include digital publishing and the future of the book. He has used various UNIX operating systems for 20 years. You can reach him at stutz@dsl.org.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=147334
ArticleTitle=Hone your regexp pattern-building skills
publish-date=07182006
author1-email=stutz@dsl.org
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers