Mastering regular expressions in PHP, Part 2: How to process text in PHP

Mastering regular expressions in PHP is easy just apply greed, sloth, and envy

Here in Part 2 of this "Mastering regular expressions in PHP" series, learn how to solve a variety of difficult text processing problems with a few advanced regular expression (regex) operators.

Martin Streicher (martin.streicher@gmail.com), Editor in Chief, McClatchy Interactive

Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986.



08 January 2008

Also available in Russian Japanese Portuguese

Although the terms data and information are used interchangeably, there's a big difference between the two. Data is factual. Data is a list of temperatures, a litany of recent sales, or an inventory of parts on hand. Information is insightful. Information is a prediction of the weather, a profit-and-loss statement, and a sales trend. Data is recorded as ones and zeros. Information is coalesced by synapses.

Between data and information lies the software application: the engine that transforms the former to the latter and vice versa. For example, if you buy a book online, a shopping application permutes your information — the book title, your identity, your bank account information — into data, such as an order number, a sales price, details of the credit card transaction, and an adjustment to on-hand inventory. Similarly, the shopping application remakes the data into a pick request for the warehouse and a shipping label and tracking number — information required to effectuate the sale.

Indeed, the complexity of creating an application is directly proportional to the transformations it affects. A Web-site guest book, which translates a name and address to fields in a database, is simple. Meanwhile, an online store, which translates many kinds of information into the data model of the business and converts the data to information to inspire decision-making, is quite elaborate. The art of programming is the adept manipulation of data and information — a skill akin to capturing light in chiaroscuro.

As introduced in Part 1, regexes are one of the most powerful tools for manipulating data. Using a concise shorthand, regexes describe the form of data and decompose it. For example, you could use the following regex to process any temperature in Celsius or Fahrenheit: /^([+-]?[0-9]+)([CF])$/.

The regex matches the beginning of the line (represented by the caret, ^), followed by a positive sign, a negative sign, or neither ([+-]?), followed by an integer ([0-9]+), a scale qualifier — one of Celsius or Fahrenheit ([CF]) — and terminated by the end of line (represented by the dollar sign, $).

In the temperature regex, the beginning-of-line and end-of-line operators are two examples of zero-width assertions, or matches that are positional, not literal. The parentheses aren't literal, either. Instead, embedding a pattern within parentheses captures the text that matches the pattern. Hence, if text matched the entire pattern, the first set of parentheses would yield a string representing a positive or negative integer, such as +49. The second set of parentheses would yield either the letter C or F.

Part 1 introduces the notion of the regex and the PHP functions available to compare text to patterns and extract matches. Now I delve more deeply into regexes and look at a handful of advanced operators and recipes.

Parentheses to the rescue (again)

Most of the time, you use a set of parentheses to define a subpattern and to capture the text that matches the subpattern. However, parentheses need not capture the subpattern. As in a complex arithmetic formula, you can use parentheses simply to group terms.

Here's an example. Can you tell what kind of data it matches?

/[-a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info)/i

This regex matches host names (albeit solely within the .com, .edu, and .info domains), as you may have predicted. What's different is the addition of ?:. The subpattern qualifier ?: disables capture, leaving the parentheses to clarify the precedence of operation. Here, for example, the phrase (?:\.[-a-z0-9]+)* matches zero or more instances of a string, such as ".ibm." Similarly, the phrase \.(?:com|edu|info) expresses a literal period followed by any of the strings com, edu, or info.

Disabling capture may seem pointless, until you realize that capture requires extra processing. If your code processes a lot of data, omitting capture might be worthwhile. Additionally, if your regex is particularly intricate, disabling capture in certain subpatterns can make it easier to extract the subpatterns you're truly interested in.

Note: The i modifier at the end of the regex makes all matches within the pattern, case-insensitive. The subset a-z, therefore, matches all letters, independent of case.

PHP offers other subpattern modifiers. Using the regex test jig provided in Part 1 (repeated here in Listing 1), match the regex ((?i)edu) against candidate strings "EDU," "edu," and "Edu." If you begin a subpattern with the modifier (?i), matching in the subpattern is case-insensitive. Case-sensitivity is re-enabled as soon as the subpattern ends. (Compare this to the /.../i modifier above, which applies to the entire pattern.)

Listing 1. A simple regex test utility
<?php
    //
    // divide the comma-separated list into individual words
    //   the third parameter, -1, permits a limitless number of matches
    //   the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
    //
    $words = preg_split( '/,/',  $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );

    //
    // remove the leading and trailing spaces from each element
    //
    foreach ( $words as $key => $value ) { 
        $words[ $key ] = trim( $value ); 
    }

    //
    // find the words that match the regular expression
    //
    $matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );

    print_r( $_REQUEST['regex' ] ); 
    echo( '<br /><br />' );
    
    print_r( $words ); 
    echo( '<br /><br />' );
    
    print_r( $matches );
    
    exit;
?>

Another useful subpattern modifier is (?x). It lets you embed whitespace in a subpattern, making the regex easier to read. Thus, the subpattern ((?x) edu | com | info) (notice the spaces between the alternation operators, added for legibility) is the same as (edu|com|info). You can use the global modifier /.../x to embed whitespace and comments in the entire regex, as shown below.

Listing 2. Embed whitespace and comments
$matches = preg_grep( 
            "/
              [- a-z 0-9]+            # machine name
              (?: \. [- a-z 0-9]+)*   # subdomains
              \. (?: com | edu | info)# domain
             /xi", $words );

As you can see, you can also combine modifiers as needed. Also, if you need to match a literal space while using (?x), say, use the metacharacter \s to match any whitespace character or \ (the backslash followed by a space) to match a single space, as in ((?x) hello \ there).


Peeking around

The vast majority of regex use validates or decomposes input into individual tidbits stored as data in a repository or acted upon immediately by the application. Processing fields in a form, parsing XML code, and interpreting a protocol are canonical uses.

Another use of regex is formatting, or normalizing or improving, the readability of data. Rather than use regex to find and extract text, formatting uses regex to find and insert text at the proper position.

Here's a useful application of formatting. Assume that a Web form submits a salary in whole dollars to your application. Because you store salary as an integer, your application must strip punctuation from the posted data before it's persisted. However, when the data is retrieved from the repository, you want to reformat it to be readable, using commas. A simple PHP call to convert the dollar amount to a number is shown below.

Listing 3. Convert dollar amounts to a number
$salary = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );

if ( is_numeric( $salary ) ) {
    // persist the data
}
else {
    // error
}

The call to the preg_replace() function replaces the dollar sign, any whitespace, and every comma with the empty string, yielding what's supposed to be an integer. If the call to is_numeric() validates the input, the data can be stored.

Next, let's reverse the operation to emit the number with a currency symbol and commas to separate hundreds, thousands, and millions. You could write code to find those units, or you can use look ahead and look behind to insert commas at the proper position. The subpattern modifier ?<= denotes look behind (that is, look left) of the current position. The modifier ?= stands for look ahead (look right) of the current position.

So, what's the proper position? Anyplace in the string where there's at least one digit to the left and one or more groups of three digits to the right, excluding the decimal point and the number of cents. Given that rule and the two look-around modifiers, both of which are zero-width assertions, this statement does the trick:

$pretty_print = preg_replace( "/(?<=\d)(?=\d\d\d)+$)/", ',', $salary );

How does the latter regex work? Beginning at the start of the string and proceeding through each position, the regex asserts, "Is there at least one digit to the left and one or more groups of three digits to the right?" If so, a comma "replaces" the zero width assertion.

Many complex matches can be dispensed with easily using a strategy similar to the one above. For instance, here's another use of look-ahead that readily solves a common dilemma.

Listing 4. Look-ahead example
$tab_data = preg_replace( '/
    ,                               # look for a comma
    (?=                             # then look ahead for
        (?:[^"]*$)                  # a string with no quotes and eol
        |                           #  -or-
        (?:[^"]*"[^"]*"[^"]*)*$     # a string with balanced quotes
    )                               # 
    /x', "\t", $csv_data );

This preg_replace() instruction transforms a line of comma-separated data into a line of tab-separated data. Wisely, it doesn't replace a comma found within a quoted string.

The regex makes an assertion at every occurrence of a comma (that's the comma at the very beginning of the regex): "Are no quotes ahead or is an even number of quotes ahead?" If the assertion is true, the comma may be replaced with a tab (the \t).

If you don't like the look-around operators, or if you're working in a language that does not provide them, you can embed commas in a number using a traditional regex, although doing so requires many iterations to accomplish. Following is one possible solution.

Listing 5. Embedding commas
$pretty_print = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );

do {
    $old = $pretty_print;
    $pretty_print = preg_replace( "/(\d)(\d\d\d\b)/", "$1,$2", $pretty_print );
} while ( $old != $pretty_print );

Let's step through the code. First, a salary parameter is stripped of punctuation to simulate a read of an integer from the database. Next, the loop repeats, finding positions where a single digit ((\d) is followed by triple digits ((\d\d\d\) terminated immediately at a word boundary, designated by \b. A word boundary is another zero width assertion and is defined to be:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between a word character and a nonword character immediately after the word character.
  • Between a nonword character and a word character immediately after the nonword character.

Hence, whitespace, a period, and a comma are each a valid word boundary.

Because of the outer loop, the regex essentially advances right to left looking for a digit followed by three digits and a word boundary. If a match is found, a comma is inserted between the two subpatterns. As long as preg_replace() finds a match, the loop must continue, which explains the condition $old != $pretty_print.


Greediness and laziness

Regexes are quite powerful. Sometimes, a little too powerful. For example, consider what happens when the regex ".*" is applied to the string "The author of 'Wicked' also wrote 'Mirror, Mirror.'" While you might expect preg_match() to return two matches, you may be surprised to find just a single result: 'Wicked' also wrote 'Mirror, Mirror.'

The cause? Unless you specify otherwise, operators such as * (none or more) and + (one or more) are greedy. If a pattern can continue to match, it will, yielding the largest result possible. To keep matches minimal, you must force certain operators to be lazy. Lazy operations find the shortest match, then stop. To make an operator lazy, add a question-mark suffix. Listing 6 shows an example.

Listing 6. Adding a question-mark suffix
    $text = 'The author of "Wicked" also wrote "Mirror, Mirror."';
    if ( preg_match_all( '/".*?"/', $text, $matches ) ) {
        print_r( $matches[0] );
    }

The snippet above produces:

Array ( [0] => "Wicked" [1] => "Mirror, Mirror." )

The regex ".*?" translates to "match a quote, followed by just enough characters, followed by a quote.

Sometimes, however, the * operator can be too lazy. Take the following snippet of code, for instance. What does it produce?

Listing 7. A simple regex test utility
if (preg_match( "/([0-9]*)/", "-123", $matches  ) ) {
    print_r( $matches );
}

What did you guess? "123"? "1"? No output? In fact, the output is Array ( [0] => [1] => ), meaning a match was made, but nothing was captured. Why? Recall that the operator * can match zero or more times. Here, the expression [0-9]* matches zero times against the beginning of the string and processing halts.

To fix the problem, add a zero-width assertion to anchor the match, which forces the regex engine to continue matching; /([0-9]*\b/ suffices.


More tips and tricks

Regexes can solve simple or difficult text processing problems. Begin with a handful of operators and expand your vocabulary as your experience grows. To jump-start your endeavor, here are a handful of tips and tricks.

Make your regexes portable with character classes

You've seen metacharacters, such as \s, that match any whitespace characters. In addition, many regex implementations support predefined classes of characters that are easier to use and portable across multiple written languages. For instance, the character class [:punct:] stands for all punctuation characters in the current locale. You can use [:digit:] in place of [0-9], and [:alpha:] is a more portable replacement for [-a-zA-Z0-9_]. For example, you can strip all punctuation from a string by using the statement:

$clean = preg_replace( "/[[:punct:]]/", '', $string );

The character class is more succinct than spelling out all the punctuation characters. Consult the documentation for your version of PHP for a complete list of character classes.

Exclude what you're not looking for

As shown in the comma-separated value (CSV)- to tab-delimited data example, it is sometimes easier and more precise to list what you don't want to match. A set that begins with a caret (^) matches any character not included in the set. For example, you can use the regex /[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/ to validate U.S. phone numbers. Using an exclusionary set, you could write the regex as the more explicit /[^01][0-9]{2}[^01][0-9]{2}[0-9]{4}/. Both regexes work, although the latter is arguably more plain in its intent.

Skip the newline

If your input spans multiple lines, a typical regex won't suffice because scanning terminates at a newline, denoted by $. However, if you use the s or m modifier, the regex engine treats the input differently. The former treats a string as a single line, forcing the dot to match a newline (it typically does not). The latter treats the string as multiple lines, where ^ and $ match the beginning and end of any line, respectively. Here's an example: If you set $string = "Hello,\nthere";, the statement preg_match( "/.*/s", $string, $matches) would set $matches[0] to Hello,\nthere. (Removing the s yields Hello.)

With regular expressions, your imagination and ingenuity are virtually the only limits to what you can achieve.

Resources

Learn

Get products and technologies

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.
  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

  • Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=276788
ArticleTitle=Mastering regular expressions in PHP, Part 2: How to process text in PHP
publish-date=01082008