 | Level: Intermediate Martin Streicher (martin.streicher@gmail.com), Editor in Chief, McClatchy Interactive
08 Jan 2008 Here in Part 2 of this "Mastering
regular expressions in PHP" series, learn how to solve a variety of difficult text
processing problems with a few advanced regular expression (regex) operators.
Although the terms data and information are used interchangeably, there's
a big difference between the two. Data is factual. Data is a list of temperatures, a
litany of recent sales, or an inventory of parts on hand. Information is insightful.
Information is a prediction of the weather, a profit-and-loss statement, and a sales
trend. Data is recorded as ones and zeros. Information is coalesced by synapses.
Between data and information lies the software application: the engine that
transforms the former to the latter and vice versa. For example, if you buy a book
online, a shopping application permutes your information — the book title, your
identity, your bank account information — into data, such as an order number, a
sales price, details of the credit card transaction, and an adjustment to on-hand
inventory. Similarly, the shopping application remakes the data into a pick request for
the warehouse and a shipping label and tracking number — information required to effectuate the sale.
Indeed, the complexity of creating an application is directly proportional to the
transformations it affects. A Web-site guest book, which translates a name and address
to fields in a database, is simple. Meanwhile, an online store, which translates many
kinds of information into the data model of the business and converts the data to
information to inspire decision-making, is quite elaborate. The art of programming is
the adept manipulation of data and information — a skill akin to capturing light in chiaroscuro.
As introduced in Part 1, regexes are one of the
most powerful tools for manipulating data. Using a concise shorthand, regexes describe
the form of data and decompose it. For example, you could use the following regex to
process any temperature in Celsius or Fahrenheit: /^([+-]?[0-9]+)([CF])$/.
The regex matches the beginning of the line (represented by the caret,
^), followed by a positive sign, a negative sign, or neither
([+-]?), followed by an integer ([0-9]+), a scale qualifier — one of Celsius or Fahrenheit
([CF]) — and terminated by the end of line
(represented by the dollar sign, $).
In the temperature regex, the beginning-of-line and end-of-line operators are two
examples of zero-width assertions, or matches that are positional, not literal.
The parentheses aren't literal, either. Instead, embedding a pattern within parentheses
captures the text that matches the pattern. Hence, if text matched the entire pattern,
the first set of parentheses would yield a string representing a positive or negative
integer, such as +49. The second set of parentheses would yield either the letter C or F.
Part 1 introduces the notion of the regex and the PHP functions available to compare
text to patterns and extract matches. Now I delve more deeply into regexes and
look at a handful of advanced operators and recipes.
Parentheses to the rescue (again)
Most of the time, you use a set of parentheses to define a subpattern and to capture
the text that matches the subpattern. However, parentheses need not capture the
subpattern. As in a complex arithmetic formula, you can use parentheses simply to group terms.
Here's an example. Can you tell what kind of data it matches?
/[-a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info)/i
|
This regex matches host names (albeit solely within the .com, .edu, and .info domains), as
you may have predicted. What's different is the addition of ?:. The subpattern qualifier ?: disables capture, leaving the
parentheses to clarify the precedence of operation. Here, for example, the phrase (?:\.[-a-z0-9]+)* matches zero or more instances of a string, such
as ".ibm." Similarly, the phrase \.(?:com|edu|info)
expresses a literal period followed by any of the strings com, edu, or info.
Disabling capture may seem pointless, until you realize that capture requires extra
processing. If your code processes a lot of data, omitting capture might be worthwhile.
Additionally, if your regex is particularly intricate, disabling capture in certain
subpatterns can make it easier to extract the subpatterns you're truly interested in.
Note: The i modifier at the end of the regex makes
all matches within the pattern, case-insensitive. The subset a-z, therefore, matches all letters, independent of case.
PHP offers other subpattern modifiers. Using the regex test jig provided in Part 1
(repeated here in Listing 1), match the regex ((?i)edu) against candidate strings "EDU," "edu," and "Edu." If you
begin a subpattern with the modifier (?i), matching in the
subpattern is case-insensitive. Case-sensitivity is re-enabled as soon as the subpattern
ends. (Compare this to the /
...
/i modifier above, which applies to the entire pattern.)
Listing 1. A simple regex test utility
<?php
//
// divide the comma-separated list into individual words
// the third parameter, -1, permits a limitless number of matches
// the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
//
$words = preg_split( '/,/', $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );
//
// remove the leading and trailing spaces from each element
//
foreach ( $words as $key => $value ) {
$words[ $key ] = trim( $value );
}
//
// find the words that match the regular expression
//
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );
print_r( $_REQUEST['regex' ] );
echo( '<br /><br />' );
print_r( $words );
echo( '<br /><br />' );
print_r( $matches );
exit;
?>
|
Another useful subpattern modifier is (?x). It lets you
embed whitespace in a subpattern, making the regex easier to read. Thus, the subpattern
((?x) edu | com | info) (notice the spaces between the
alternation operators, added for legibility) is the same as (edu|com|info). You can use the global modifier /
...
/x to embed whitespace and
comments in the entire regex, as shown below.
Listing 2. Embed whitespace and comments
$matches = preg_grep(
"/
[- a-z 0-9]+ # machine name
(?: \. [- a-z 0-9]+)* # subdomains
\. (?: com | edu | info)# domain
/xi", $words );
|
As you can see, you can also combine modifiers as needed. Also, if you need to
match a literal space while using (?x), say, use the
metacharacter \s to match any whitespace character or \ (the backslash followed by a space) to match a single space, as in ((?x) hello \ there).
Peeking around
The vast majority of regex use validates or decomposes input into individual
tidbits stored as data in a repository or acted upon immediately by the
application. Processing fields in a form, parsing XML code, and interpreting a protocol are canonical uses.
Another use of regex is formatting, or normalizing or improving, the readability
of data. Rather than use regex to find and extract text, formatting uses regex to find and insert text at the proper position.
Here's a useful application of formatting. Assume that a Web form submits a salary in
whole dollars to your application. Because you store salary as an integer, your
application must strip punctuation from the posted data before it's persisted. However,
when the data is retrieved from the repository, you want to reformat it to be readable,
using commas. A simple PHP call to convert the dollar amount to a number is shown below.
Listing 3. Convert dollar amounts to a number
$salary = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );
if ( is_numeric( $salary ) ) {
// persist the data
}
else {
// error
}
|
The call to the preg_replace() function replaces the dollar
sign, any whitespace, and every comma with the empty string, yielding what's supposed
to be an integer. If the call to is_numeric() validates the input, the data can be stored.
Next, let's reverse the operation to emit the number with a currency symbol and commas
to separate hundreds, thousands, and millions. You could write code to find those
units, or you can use look ahead and look behind to insert commas at the
proper position. The subpattern modifier ?<= denotes
look behind (that is, look left) of the current position. The modifier ?= stands for look ahead (look right) of the current position.
So, what's the proper position? Anyplace in the string where there's at least one
digit to the left and one or more groups of three digits to the right, excluding the
decimal point and the number of cents. Given that rule and the two look-around
modifiers, both of which are zero-width assertions, this statement does the trick:
$pretty_print = preg_replace( "/(?<=\d)(?=\d\d\d)+$)/", ',', $salary );
|
How does the latter regex work? Beginning at the start of the string and proceeding
through each position, the regex asserts, "Is there at least one digit to the left and
one or more groups of three digits to the right?" If so, a comma "replaces" the zero width assertion.
Many complex matches can be dispensed with easily using a strategy similar to the one
above. For instance, here's another use of look-ahead that readily solves a common
dilemma.
Listing 4. Look-ahead example
$tab_data = preg_replace( '/
, # look for a comma
(?= # then look ahead for
(?:[^"]*$) # a string with no quotes and eol
| # -or-
(?:[^"]*"[^"]*"[^"]*)*$ # a string with balanced quotes
) #
/x', "\t", $csv_data );
|
This preg_replace() instruction transforms a line of
comma-separated data into a line of tab-separated data. Wisely, it doesn't replace a
comma found within a quoted string.
The regex makes an assertion at every occurrence of a comma (that's the comma at the
very beginning of the regex): "Are no quotes ahead or is an even number of quotes
ahead?" If the assertion is true, the comma may be replaced with a tab (the \t).
If you don't like the look-around operators, or if you're working in a language that
does not provide them, you can embed commas in a number using a traditional regex,
although doing so requires many iterations to accomplish. Following is one possible solution.
Listing 5. Embedding commas
$pretty_print = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );
do {
$old = $pretty_print;
$pretty_print = preg_replace( "/(\d)(\d\d\d\b)/", "$1,$2", $pretty_print );
} while ( $old != $pretty_print );
|
Let's step through the code. First, a salary parameter is stripped of punctuation to
simulate a read of an integer from the database. Next, the loop repeats, finding
positions where a single digit ((\d) is followed by triple
digits ((\d\d\d\) terminated immediately at a word boundary,
designated by \b. A word boundary is another zero width assertion and is defined to be:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between a word character and a nonword character immediately after the word character.
- Between a nonword character and a word character immediately after the nonword character.
Hence, whitespace, a period, and a comma are each a valid word boundary.
Because of the outer loop, the regex essentially advances right to left looking for a
digit followed by three digits and a word boundary. If a match is found, a comma is
inserted between the two subpatterns. As long as preg_replace() finds a match, the loop must continue, which explains
the condition $old != $pretty_print.
Greediness and laziness
Regexes are quite powerful. Sometimes, a little too powerful. For example, consider
what happens when the regex ".*" is applied to the string
"The author of 'Wicked' also wrote 'Mirror, Mirror.'" While you might expect preg_match() to return two matches, you may be surprised to find
just a single result: 'Wicked' also wrote 'Mirror, Mirror.'
The cause? Unless you specify otherwise, operators such as *
(none or more) and + (one or more) are greedy. If a
pattern can continue to match, it will, yielding the largest result possible. To keep
matches minimal, you must force certain operators to be lazy. Lazy
operations find the shortest match, then stop. To make an operator lazy, add a
question-mark suffix. Listing 6 shows an example.
Listing 6. Adding a question-mark suffix
$text = 'The author of "Wicked" also wrote "Mirror, Mirror."';
if ( preg_match_all( '/".*?"/', $text, $matches ) ) {
print_r( $matches[0] );
}
|
The snippet above produces:
Array ( [0] => "Wicked" [1] => "Mirror, Mirror." )
|
The regex ".*?" translates to "match a quote, followed by
just enough characters, followed by a quote.
Sometimes, however, the * operator can be too lazy. Take the
following snippet of code, for instance. What does it produce?
Listing 7. A simple regex test utility
if (preg_match( "/([0-9]*)/", "-123", $matches ) ) {
print_r( $matches );
}
|
What did you guess? "123"? "1"? No output? In fact, the output is Array ( [0] => [1] => ), meaning a match was made, but nothing was
captured. Why? Recall that the operator * can match zero or
more times. Here, the expression [0-9]* matches zero times
against the beginning of the string and processing halts.
To fix the problem, add a zero-width assertion to anchor the match, which forces the
regex engine to continue matching; /([0-9]*\b/ suffices.
More tips and tricks
Regexes can solve simple or difficult text processing problems. Begin with a
handful of operators and expand your vocabulary as your experience grows. To
jump-start your endeavor, here are a handful of tips and tricks.
Make your regexes portable with character classes
You've seen metacharacters, such as \s, that match any
whitespace characters. In addition, many regex implementations support predefined
classes of characters that are easier to use and portable across multiple written
languages. For instance, the character class [:punct:]
stands for all punctuation characters in the current locale. You can use [:digit:] in place of [0-9], and [:alpha:] is a more portable replacement for [-a-zA-Z0-9_]. For example, you can strip all punctuation from a string by using the statement:
$clean = preg_replace( "/[[:punct:]]/", '', $string );
|
The character class is more succinct than spelling out all the punctuation characters.
Consult the documentation for your version of PHP for a complete list of character classes.
Exclude what you're not looking for
As shown in the comma-separated value (CSV)- to tab-delimited data example, it is
sometimes easier and more precise to list what you don't want to match. A set
that begins with a caret (^) matches any character not
included in the set. For example, you can use the regex
/[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/ to validate U.S. phone numbers. Using an
exclusionary set, you could write the regex as the more explicit /[^01][0-9]{2}[^01][0-9]{2}[0-9]{4}/. Both regexes work, although the
latter is arguably more plain in its intent.
Skip the newline
If your input spans multiple lines, a typical regex won't suffice because scanning
terminates at a newline, denoted by $. However, if you use
the s or m modifier, the regex
engine treats the input differently. The former treats a string as a single line,
forcing the dot to match a newline (it typically does not). The latter treats the
string as multiple lines, where ^ and $ match the beginning and end of any line, respectively. Here's an
example: If you set $string = "Hello,\nthere";, the
statement preg_match( "/.*/s", $string, $matches) would set
$matches[0] to Hello,\nthere.
(Removing the s yields Hello.)
With regular expressions, your imagination and ingenuity are virtually the only limits to what you can achieve.
Resources Learn
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author  | |  | Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986. |
Rate this page
|  |