 | Level: Intermediate Martin Streicher (martin.streicher@gmail.com), Editor in Chief, McClatchy Interactive
01 Jan 2008 Pattern matching is such a common chore for software that a special
shorthand — regular expressions — has evolved to make light work of the
task. Learn how to use this shorthand in your code here in Part 1 of this "Mastering
regular expressions in PHP" series.
All machines consume input, perform some sort of work, and yield output. A telephone,
for example, converts sound energy to an electrical signal and back again to audio to
enable conversation. An engine imbibes fuel (steam, fission, petrol, or elbow grease)
and transforms it into work. And a blender devours rum, ice, lime, and curacao, and
stirs vigorously to produce a Mai Tai. (Or, if you prefer something
more metropolitan, try some champagne and pear nectar to enjoy a Bellini. The blender
is truly a flexible and remarkable machine.)
Because software transforms data, each application is also a machine — albeit a
"virtual" one, given the absence of physical parts. A compiler, for instance, expects
source code as input and transmutes it to binary code suitable for execution. A weather
modeler yields predictions based on historical measurements. And an image editor
consumes and emits pixels, applying rules to each pixel or groups of pixels to, say, sharpen or stylize an image.
Just like any other machine, a software application expects certain raw
materials, such as a list of numbers, data encapsulated in an XML schema, or a
protocol. If a program is fed the wrong material — divergent in type or form
— the result is likely to be unpredictable, even catastrophic. As the adage
says, "Garbage in, garbage out."
Virtually all nontrivial problems require you to filter good data from bad,
reject bad data to prevent errant output, or both. This is certainly the case for a PHP
Web application. Whether input comes from a manual form or a programmatic Asynchronous
JavaScript + XML (Ajax) request, the program must vet the incoming information before
any computation can occur. A numeric value may have to lie within a certain range or be
restricted to a whole number. A value may need to match a specific format, such as a
postal delivery code. For example, a U.S. ZIP code is five digits plus an
optional "Plus 4" qualifier composed of a hyphen and four additional digits. Other
strings may have to be a certain number of characters, too, such as two letters for a
U.S. state abbreviation. Strings are particularly nefarious: A PHP application must
remain vigilant against a malicious actor embedding SQL queries, JavaScript code, or
other code capable of altering the behavior of the application or circumventing security.
But how does a program tell whether input is numeric or conforms to a convention, such
as a postal code? Fundamentally, performing a match requires a small parser —
craft a state machine, read input, process tokens, monitor state, and yield a result.
However, even a simple parser can be painful to create and maintain.
Luckily, pattern-matching analyses are so commonly required in computing that a special
shorthand and, yes, engine have evolved over time (since the dawn of UNIX® or so)
to make light work of the chore. A regular expression (regex) describes patterns in a concise,
readable notation. Given a regex and a datum, a regex engine
yields whether the datum matches a pattern and, if a match was found, what matched.
Here's a brief example of applying a regex, drawn from the UNIX
command-line utility grep, which searches for a specified
pattern among the content of one or more UNIX text files. The command grep -i -E '^Bat' searches for the sequence beginning-of-line (indicated with the caret, [^]), followed
immediately by upper- or lowercase letters b, a, and t (the -i option ignores case in pattern matches, so B and b
are equivalent, for instance). Hence, given the file heroes.txt:
Listing 1. heroes.txt
Catwoman
Batman
The Tick
Black Cat
Batgirl
Danger Girl
Wonder Woman
Luke Cage
The Punisher
Ant Man
Dead Girl
Aquaman
SCUD
Blackbolt
Martian Manhunter
|
The aforementioned grep command would yield two matches:
Regular expressions
PHP offers two regex programming interfaces, one for Portable Operating
System Interface (POSIX) and another for Perl Compatible Regular Expressions (PCRE). By
and large, the latter interface is preferred, because PCRE is much more powerful than
the POSIX implementation, offering all the operators found in Perl. Read the
PHP documentation to learn more about the POSIX regex function calls (see
Resources). Here, I focus on the PCRE features.
A PHP PCRE regex contains operators to match against specific characters
and other operators; against a specific location, such as the start or end of a string;
or against the beginning or end of a word. A regex can also describe alternates, which you might describe as "this" or
"that"; fixed-, variable-, or indefinite-length repetition; sets of characters (for
example, "any of the letters from a to m"); and classes, or kinds
of characters (printable characters or punctuation), among other techniques. Special
operators in regexes also permit grouping — a way to apply an operator to other operators en masse.
Table 1 shows some common regex operators. You can concatenate and
combine the primitives in Table 1 (and other operators) and use them in combination to
build (very) complex regexes.
Table 1. Common regex operators
| Operator | Purpose |
|---|
| . (period) | Match any single character |
|---|
| ^ (caret) | Match the empty string that occurs at the beginning of a line or string |
|---|
| $ (dollar sign) | Match the empty string that occurs at the end of a line |
|---|
| A | Match an uppercase letter A
|
|---|
| a | Match a lowercase letter a
|
|---|
| \d | Match any single digit |
|---|
| \D | Match any single nondigit character |
|---|
| \w | Match any single alphanumeric character; a synonym is
[:alnum:]
|
|---|
| [A-E] | Match any of uppercase A, B, C, D, or E
|
|---|
| [^A-E] | Match any character except uppercase A, B, C, D, or E
|
|---|
| X? | Match none or one capital letter X
|
|---|
| X* | Match zero or more capital Xes |
|---|
| X+ | Match one or more capital Xes |
|---|
| X{n} | Match exactly n capital Xes |
|---|
| X{n,m} | Match at least n and no more than m capital Xes; if you omit
m, the expression tries to match at least n
Xes |
|---|
| (abc|def)+ | Match a sequence of at least one abc and def;
abc
and def would match |
|---|
Here's an example of a common use of a regex. Say your Web site requires
each user to create a login. Each user name must be at least three, but not more than
10 alphanumeric characters and must begin with a letter. To enforce these
specifications, you could use the following regex to validate the user name
when it's submitted to your application: ^[A-Za-z][A-Za-z0-9_]{2,9}$.
The caret matches the beginning of the string. The first set, [A-Za-z], represents any letter. The second set, [A-Za-z0-9_]{2,9}, represents a series of at least two and up to
nine of any letter, any digit, and the underscore. And the dollar sign ($) matches the end of the string.
At first glance, the dollar sign may seem unnecessary, but it's critical. If
you omit it, your regex would match any string that begins with a letter,
contains two to nine alphanumeric characters, and any number of any other characters.
In other words, without the dollar sign to anchor the end of the string, a very long
string with a matching prefix, such as "martin1234-cruft," would yield a false positive.
Programming PHP and regexes
PHP provides functions to find matches in text, to replace each match with other text
(a la search and replace), and to find matches among the elements of a list. The functions are:
-
preg_match()
-
preg_match_all()
-
preg_replace()
-
preg_replace_callback()
-
preg_grep()
-
preg_split()
-
preg_last_error()
-
preg_quote()
To demonstrate the functions, let's write a small PHP application that searches a list
of words for a specific pattern, where the words and the regex are
provided by a traditional Web form, and the results are echoed to the browser using the
simple print_r() function. Such a little program is useful
if you want to test or refine a regex.
Listing 2 shows the PHP code. All the input is provided through a
simple HTML form. (The corresponding form is not shown, and code to trap errors in the PHP code has been omitted for brevity.)
Listing 2. Compare text to a pattern
<?php
//
// divide the comma-separated list into individual words
// the third parameter, -1, permits a limitless number of matches
// the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
//
$words = preg_split( '/,/', $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );
//
// remove the leading and trailing spaces from each element
//
foreach ( $words as $key => $value ) {
$words[ $key ] = trim( $value );
}
//
// find the words that match the regular expression
//
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );
print_r( $_REQUEST['regex' ] );
echo( '<br /><br />' );
print_r( $words );
echo( '<br /><br />' );
print_r( $matches );
exit;
?>
|
First, the string of comma-separated words is divided into individual elements using
the preg_split() function. This function divides the string
at every point that matches the provided regex. Here, the regex is simply , (a comma, the eponymous delimiter of a
comma-separated list). The leading and trailing slash shown in the code simply indicate
the start and end of the regex.
The third and fourth arguments of preg_split() are optional,
but each is useful. Supply an integer, n, for the third argument to return only
the first n matches; or supply -1 for all matches.
If you specify a fourth argument, the flag PREG_SPLIT_NO_EMPTY, preg_split() disposes of any empty results.
Next, each element in the list of comma-separated words is trimmed (leading and
trailing whitespace is elided) through the trim() function,
then compared to the supplied regex. The function, preg_grep(), makes processing a list very easy: Simply provide the
pattern as the first argument and an array of words to match as the second argument.
The function returns an array of matches.
For example, if you type the regex ^[A-Za-z][A-Za-z0-9_]{2,9}$ as the pattern and a list of words of
varied length, you might get something like Listing 3.
Listing 3. Result of a simple regex
^[A-Za-z][A-Za-z0-9_]{2,9}$
Array ( [0] => martin [1] => 1happy [2] => hermanmunster )
Array ( [0] => martin )
|
By the way, you can invert the preg_grep() operation and
find elements that don't match the pattern (the same as grep
-v on the command line) with the optional flag PREG_GREP_INVERT. Replacing line 22 with
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words, PREG_GREP_INVERT ) and
reusing the input of Listing 3 yields Array ( [1] => 1happy [2] =>
hermanmunster ).
Decomposing strings
The functions preg_split() and preg_grep() are great little functions. The former can decompose a
string into substrings if the substrings are separated by a predictable pattern. The
function preg_grep() can also filter a list quickly.
But what happens if a string must be decomposed using one or more complex rules? For
instance, U.S. phone numbers often appear as "(305) 555-1212," "305-555-1212," or
"305.555.1212." If you remove the punctuation, all reduce to 10 digits, which is easy
to recognize as using the regex \d{10}. However,
the three-digit area code and three-digit prefix of phone numbers in the United States
cannot start with a zero or a one (because both are prefixes for nonlocal calls).
Rather than split the numeric sequence into individual digits and write complex code, a
regex can test for validity.
Listing 4 shows a snippet of code to perform the task.
Listing 4. Determine whether a phone number is a
valid U.S. phone number
<?php
$punctuation = preg_quote( "().-" );
$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );
$valid = "/[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/";
if ( preg_match( $valid, $number ) == 1 ) {
echo( "${_REQUEST[ 'number' ]} is valid<br />" );
}
exit;
?>
|
Let's step through the code:
- As shown in Table 1, regexes use a small set of
operators, such as brackets (
[ ]), to name a set. If you
want to match such an operator in subject text, you must "escape" the operator in the
regex with a preceding backslash (\). After you
escape the operator, it matches like any other literal. For instance, if you want to a
match a literal period, say, as found in a fully qualified host name, write \.. Optionally, you can pass a string to preg_quote() to automatically escape any regex operator
it finds, as in line 1. If you use echo() $punctuation after
line 1, you should see \(\)\.-.
- Line 2 removes all punctuation from the phone number. The
preg_replace() function replaces any occurrence of a character in
$punctuation
— hence, the set operators [ ]
— with the empty string, effectively eliding the
characters. The new string is returned and assigned to $number.
- Line 4 defines the pattern for a valid U.S. telephone number.
- Line 5 performs the match, comparing the now digits-only phone number to the pattern.
The function
preg_match() returns 1 if there is a match. If
no match is found, preg_match() returns a zero. If an error
occurred during processing, the function returns False. Thus, to check for success, see
if the return value is 1. Otherwise, check the result of preg_last_error() (if you use PHP V5.2.0 or later). If not zero, you
may have exceeded a computing limit, such as how deeply a regex can
recurse. You can find a discussion of the constants and limits used with PHP regexes on the PCRE Regular
Expression Functions page (see Resources).
 |
Captures
There are many instances when a "Does this match?" test is all that's needed —
as in data validation. More often, though, a regex is used to prove a match
and to extract information about the match.
Returning to the example of the telephone number, if a match is made, you may want to
store the area code, prefix, and line number in separate fields of a database. Regexes can remember what's been matched with capture.
The capture operator is the parentheses, and the operator
can appear anywhere in the regex. You can also nest captures to find
subsegments of larger captures. For instance, to capture the area code,
prefix, and line number of the 10-digit telephone, you can use:
/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/
|
If a match is made, the first three digits are captured in the first set of
parentheses, the next three digits in the second set; and the final four digits in the
remaining operator. A variation of the preg_match() call
retrieves the captures.
Listing 5. How preg_match() retrieves the captures
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";
if ( preg_match( $valid, $number, $matches ) == 1 ) {
echo( "${_REQUEST[ 'number' ]} is valid<br />" );
echo( "Entire match: ${matches[0]}<br />" );
echo( "Area code: ${matches[1]}<br />" );
echo( "Prefix: ${matches[2]}<br />" );
echo( "Number: ${matches[3]}<br />" );
}
|
If you provide a variable as the third argument to preg_match(), such as $matches here, it
is set to a list of capture results. The zeroth element (indexed with 0) is the entire match; the first element is the match associated
with the first set of parentheses, etc., respectively.
Nested captures capture segments and subsegments, to virtually any depth. The trick
with nested captures is predicting where each match appears in a match array, such as
$matches. Here's the rule to follow: Count the number of
left parentheses from the beginning of the regex — the count is the index to the match array.
Listing 6 provides a (somewhat contrived) example to extract pieces of a street address.
Listing 6. Code to extract a street address
$address = "123 Main, Warsaw, NC, 29876";
$valid = "/((\d+)\s+(\w+)),\s+(\w+),\s+([A-Z]{2}),\s+(\d{5})/";
if ( preg_match( $valid, $address, $matches ) == 1 ) {
echo( "Street: ${matches[1]}<br />" );
echo( "Street number: ${matches[2]}<br />" );
echo( "Street name: ${matches[3]}<br />" );
echo( "City: ${matches[4]}<br />" );
echo( "State: ${matches[5]}<br />" );
echo( "Zip: ${matches[6]}<br />" );
}
|
Again, the entire match is found at index 0. Where is the street number found? Counting
from left, the street number is matched by \d+. The
enclosing left parenthesis is second from left; hence, $matches[2] is 123. $matches[4] holds the city name, while $matches[6] captures the ZIP code.
Power techniques
Processing text is very common, and PHP provides a few features that make large
numbers of operations easier. Here are a few shortcuts to keep in mind:
- The
preg_replace() function can operate on a single string
or an array of strings. If you call preg_replace() with an
array of strings rather than a string, all the elements in the array are processed to
make replacements. In this case, preg_replace() returns an
array of modified strings.
- As with other PCRE implementations, you can refer to a subpattern match from within
the replacement, allowing an operation to be self-referential. To demonstrate, consider
the problem of unifying a phone number format. All the punctuation is stripped, replaced
with dots. One solution is shown in Listing 7.
Listing 7. Replacing punctuation with dots
$punctuation = preg_quote( "().-" );
$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";
$standard = preg_replace( $valid, "\\1.\\2.\\3", $number );
if ( strcmp ($standard, $number) ) {
echo( "The standard number is $standard<br />" );
}
|
The test against the pattern and the transformation into a standard phone number if the
pattern matches occurs in one step.
Express yourself
PHP applications manage increasingly large amounts of data. Whether you need to
validate form input or decompose content, regular expressions can do the trick.
Resources Learn
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author  | |  | Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986. |
Rate this page
|  |