All machines consume input, perform some sort of work, and yield output. A telephone, for example, converts sound energy to an electrical signal and back again to audio to enable conversation. An engine imbibes fuel (steam, fission, petrol, or elbow grease) and transforms it into work. And a blender devours rum, ice, lime, and curacao, and stirs vigorously to produce a Mai Tai. (Or, if you prefer something more metropolitan, try some champagne and pear nectar to enjoy a Bellini. The blender is truly a flexible and remarkable machine.)
Because software transforms data, each application is also a machine — albeit a "virtual" one, given the absence of physical parts. A compiler, for instance, expects source code as input and transmutes it to binary code suitable for execution. A weather modeler yields predictions based on historical measurements. And an image editor consumes and emits pixels, applying rules to each pixel or groups of pixels to, say, sharpen or stylize an image.
Just like any other machine, a software application expects certain raw materials, such as a list of numbers, data encapsulated in an XML schema, or a protocol. If a program is fed the wrong material — divergent in type or form — the result is likely to be unpredictable, even catastrophic. As the adage says, "Garbage in, garbage out."
Virtually all nontrivial problems require you to filter good data from bad, reject bad data to prevent errant output, or both. This is certainly the case for a PHP Web application. Whether input comes from a manual form or a programmatic Asynchronous JavaScript + XML (Ajax) request, the program must vet the incoming information before any computation can occur. A numeric value may have to lie within a certain range or be restricted to a whole number. A value may need to match a specific format, such as a postal delivery code. For example, a U.S. ZIP code is five digits plus an optional "Plus 4" qualifier composed of a hyphen and four additional digits. Other strings may have to be a certain number of characters, too, such as two letters for a U.S. state abbreviation. Strings are particularly nefarious: A PHP application must remain vigilant against a malicious actor embedding SQL queries, JavaScript code, or other code capable of altering the behavior of the application or circumventing security.
But how does a program tell whether input is numeric or conforms to a convention, such as a postal code? Fundamentally, performing a match requires a small parser — craft a state machine, read input, process tokens, monitor state, and yield a result. However, even a simple parser can be painful to create and maintain.
Luckily, pattern-matching analyses are so commonly required in computing that a special shorthand and, yes, engine have evolved over time (since the dawn of UNIX® or so) to make light work of the chore. A regular expression (regex) describes patterns in a concise, readable notation. Given a regex and a datum, a regex engine yields whether the datum matches a pattern and, if a match was found, what matched.
Here's a brief example of applying a regex, drawn from the UNIX
command-line utility grep, which searches for a specified
pattern among the content of one or more UNIX text files. The command grep -i -E '^Bat' searches for the sequence beginning-of-line (indicated with the caret, [^]), followed
immediately by upper- or lowercase letters b, a, and t (the -i option ignores case in pattern matches, so B and b
are equivalent, for instance). Hence, given the file heroes.txt:
Listing 1. heroes.txt
Catwoman
Batman
The Tick
Black Cat
Batgirl
Danger Girl
Wonder Woman
Luke Cage
The Punisher
Ant Man
Dead Girl
Aquaman
SCUD
Blackbolt
Martian Manhunter
|
The aforementioned grep command would yield two matches:
Batman Batgirl |
PHP offers two regex programming interfaces, one for Portable Operating System Interface (POSIX) and another for Perl Compatible Regular Expressions (PCRE). By and large, the latter interface is preferred, because PCRE is much more powerful than the POSIX implementation, offering all the operators found in Perl. Read the PHP documentation to learn more about the POSIX regex function calls (see Resources). Here, I focus on the PCRE features.
A PHP PCRE regex contains operators to match against specific characters and other operators; against a specific location, such as the start or end of a string; or against the beginning or end of a word. A regex can also describe alternates, which you might describe as "this" or "that"; fixed-, variable-, or indefinite-length repetition; sets of characters (for example, "any of the letters from a to m"); and classes, or kinds of characters (printable characters or punctuation), among other techniques. Special operators in regexes also permit grouping — a way to apply an operator to other operators en masse.
Table 1 shows some common regex operators. You can concatenate and combine the primitives in Table 1 (and other operators) and use them in combination to build (very) complex regexes.
Table 1. Common regex operators
| Operator | Purpose |
|---|---|
| . (period) | Match any single character |
| ^ (caret) | Match the empty string that occurs at the beginning of a line or string |
| $ (dollar sign) | Match the empty string that occurs at the end of a line |
| A | Match an uppercase letter A |
| a | Match a lowercase letter a |
| \d | Match any single digit |
| \D | Match any single nondigit character |
| \w | Match any single alphanumeric character; a synonym is
[:alnum:]
|
| [A-E] | Match any of uppercase A, B, C, D, or E |
| [^A-E] | Match any character except uppercase A, B, C, D, or E |
| X? | Match none or one capital letter X |
| X* | Match zero or more capital Xes |
| X+ | Match one or more capital Xes |
| X{n} | Match exactly n capital Xes |
| X{n,m} | Match at least n and no more than m capital Xes; if you omit m, the expression tries to match at least n Xes |
| (abc|def)+ | Match a sequence of at least one abc and def; abc and def would match |
Here's an example of a common use of a regex. Say your Web site requires
each user to create a login. Each user name must be at least three, but not more than
10 alphanumeric characters and must begin with a letter. To enforce these
specifications, you could use the following regex to validate the user name
when it's submitted to your application: ^[A-Za-z][A-Za-z0-9_]{2,9}$.
The caret matches the beginning of the string. The first set, [A-Za-z], represents any letter. The second set, [A-Za-z0-9_]{2,9}, represents a series of at least two and up to
nine of any letter, any digit, and the underscore. And the dollar sign ($) matches the end of the string.
At first glance, the dollar sign may seem unnecessary, but it's critical. If you omit it, your regex would match any string that begins with a letter, contains two to nine alphanumeric characters, and any number of any other characters. In other words, without the dollar sign to anchor the end of the string, a very long string with a matching prefix, such as "martin1234-cruft," would yield a false positive.
PHP provides functions to find matches in text, to replace each match with other text (a la search and replace), and to find matches among the elements of a list. The functions are:
-
preg_match() -
preg_match_all() -
preg_replace() -
preg_replace_callback() -
preg_grep() -
preg_split() -
preg_last_error() -
preg_quote()
To demonstrate the functions, let's write a small PHP application that searches a list
of words for a specific pattern, where the words and the regex are
provided by a traditional Web form, and the results are echoed to the browser using the
simple print_r() function. Such a little program is useful
if you want to test or refine a regex.
Listing 2 shows the PHP code. All the input is provided through a simple HTML form. (The corresponding form is not shown, and code to trap errors in the PHP code has been omitted for brevity.)
Listing 2. Compare text to a pattern
<?php
//
// divide the comma-separated list into individual words
// the third parameter, -1, permits a limitless number of matches
// the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
//
$words = preg_split( '/,/', $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );
//
// remove the leading and trailing spaces from each element
//
foreach ( $words as $key => $value ) {
$words[ $key ] = trim( $value );
}
//
// find the words that match the regular expression
//
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );
print_r( $_REQUEST['regex' ] );
echo( '<br /><br />' );
print_r( $words );
echo( '<br /><br />' );
print_r( $matches );
exit;
?>
|
First, the string of comma-separated words is divided into individual elements using
the preg_split() function. This function divides the string
at every point that matches the provided regex. Here, the regex is simply , (a comma, the eponymous delimiter of a
comma-separated list). The leading and trailing slash shown in the code simply indicate
the start and end of the regex.
The third and fourth arguments of preg_split() are optional,
but each is useful. Supply an integer, n, for the third argument to return only
the first n matches; or supply -1 for all matches.
If you specify a fourth argument, the flag PREG_SPLIT_NO_EMPTY, preg_split() disposes of any empty results.
Next, each element in the list of comma-separated words is trimmed (leading and
trailing whitespace is elided) through the trim() function,
then compared to the supplied regex. The function, preg_grep(), makes processing a list very easy: Simply provide the
pattern as the first argument and an array of words to match as the second argument.
The function returns an array of matches.
For example, if you type the regex ^[A-Za-z][A-Za-z0-9_]{2,9}$ as the pattern and a list of words of
varied length, you might get something like Listing 3.
Listing 3. Result of a simple regex
^[A-Za-z][A-Za-z0-9_]{2,9}$
Array ( [0] => martin [1] => 1happy [2] => hermanmunster )
Array ( [0] => martin )
|
By the way, you can invert the preg_grep() operation and
find elements that don't match the pattern (the same as grep
-v on the command line) with the optional flag PREG_GREP_INVERT. Replacing line 22 with
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words, PREG_GREP_INVERT ) and
reusing the input of Listing 3 yields Array ( [1] => 1happy [2] =>
hermanmunster ).
The functions preg_split() and preg_grep() are great little functions. The former can decompose a
string into substrings if the substrings are separated by a predictable pattern. The
function preg_grep() can also filter a list quickly.
But what happens if a string must be decomposed using one or more complex rules? For
instance, U.S. phone numbers often appear as "(305) 555-1212," "305-555-1212," or
"305.555.1212." If you remove the punctuation, all reduce to 10 digits, which is easy
to recognize as using the regex \d{10}. However,
the three-digit area code and three-digit prefix of phone numbers in the United States
cannot start with a zero or a one (because both are prefixes for nonlocal calls).
Rather than split the numeric sequence into individual digits and write complex code, a
regex can test for validity.
Listing 4 shows a snippet of code to perform the task.
Listing 4. Determine whether a phone number is a valid U.S. phone number
<?php
$punctuation = preg_quote( "().-" );
$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );
$valid = "/[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/";
if ( preg_match( $valid, $number ) == 1 ) {
echo( "${_REQUEST[ 'number' ]} is valid<br />" );
}
exit;
?>
|
Let's step through the code:
- As shown in Table 1, regexes use a small set of
operators, such as brackets (
[ ]), to name a set. If you want to match such an operator in subject text, you must "escape" the operator in the regex with a preceding backslash (\). After you escape the operator, it matches like any other literal. For instance, if you want to a match a literal period, say, as found in a fully qualified host name, write\.. Optionally, you can pass a string topreg_quote()to automatically escape any regex operator it finds, as in line 1. If you useecho() $punctuationafter line 1, you should see\(\)\.-. - Line 2 removes all punctuation from the phone number. The
preg_replace()function replaces any occurrence of a character in$punctuation— hence, the set operators[ ]— with the empty string, effectively eliding the characters. The new string is returned and assigned to$number. - Line 4 defines the pattern for a valid U.S. telephone number.
- Line 5 performs the match, comparing the now digits-only phone number to the pattern.
The function
preg_match()returns 1 if there is a match. If no match is found,preg_match()returns a zero. If an error occurred during processing, the function returns False. Thus, to check for success, see if the return value is 1. Otherwise, check the result ofpreg_last_error()(if you use PHP V5.2.0 or later). If not zero, you may have exceeded a computing limit, such as how deeply a regex can recurse. You can find a discussion of the constants and limits used with PHP regexes on the PCRE Regular Expression Functions page (see Resources).
There are many instances when a "Does this match?" test is all that's needed — as in data validation. More often, though, a regex is used to prove a match and to extract information about the match.
Returning to the example of the telephone number, if a match is made, you may want to
store the area code, prefix, and line number in separate fields of a database. Regexes can remember what's been matched with capture.
The capture operator is the parentheses, and the operator
can appear anywhere in the regex. You can also nest captures to find
subsegments of larger captures. For instance, to capture the area code,
prefix, and line number of the 10-digit telephone, you can use:
/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/
|
If a match is made, the first three digits are captured in the first set of
parentheses, the next three digits in the second set; and the final four digits in the
remaining operator. A variation of the preg_match() call
retrieves the captures.
Listing 5. How
preg_match() retrieves the captures
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";
if ( preg_match( $valid, $number, $matches ) == 1 ) {
echo( "${_REQUEST[ 'number' ]} is valid<br />" );
echo( "Entire match: ${matches[0]}<br />" );
echo( "Area code: ${matches[1]}<br />" );
echo( "Prefix: ${matches[2]}<br />" );
echo( "Number: ${matches[3]}<br />" );
}
|
If you provide a variable as the third argument to preg_match(), such as $matches here, it
is set to a list of capture results. The zeroth element (indexed with 0) is the entire match; the first element is the match associated
with the first set of parentheses, etc., respectively.
Nested captures capture segments and subsegments, to virtually any depth. The trick
with nested captures is predicting where each match appears in a match array, such as
$matches. Here's the rule to follow: Count the number of
left parentheses from the beginning of the regex — the count is the index to the match array.
Listing 6 provides a (somewhat contrived) example to extract pieces of a street address.
Listing 6. Code to extract a street address
$address = "123 Main, Warsaw, NC, 29876";
$valid = "/((\d+)\s+(\w+)),\s+(\w+),\s+([A-Z]{2}),\s+(\d{5})/";
if ( preg_match( $valid, $address, $matches ) == 1 ) {
echo( "Street: ${matches[1]}<br />" );
echo( "Street number: ${matches[2]}<br />" );
echo( "Street name: ${matches[3]}<br />" );
echo( "City: ${matches[4]}<br />" );
echo( "State: ${matches[5]}<br />" );
echo( "Zip: ${matches[6]}<br />" );
}
|
Again, the entire match is found at index 0. Where is the street number found? Counting
from left, the street number is matched by \d+. The
enclosing left parenthesis is second from left; hence, $matches[2] is 123. $matches[4] holds the city name, while $matches[6] captures the ZIP code.
Processing text is very common, and PHP provides a few features that make large numbers of operations easier. Here are a few shortcuts to keep in mind:
- The
preg_replace()function can operate on a single string or an array of strings. If you callpreg_replace()with an array of strings rather than a string, all the elements in the array are processed to make replacements. In this case,preg_replace()returns an array of modified strings. - As with other PCRE implementations, you can refer to a subpattern match from within the replacement, allowing an operation to be self-referential. To demonstrate, consider the problem of unifying a phone number format. All the punctuation is stripped, replaced with dots. One solution is shown in Listing 7.
Listing 7. Replacing punctuation with dots
$punctuation = preg_quote( "().-" );
$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";
$standard = preg_replace( $valid, "\\1.\\2.\\3", $number );
if ( strcmp ($standard, $number) ) {
echo( "The standard number is $standard<br />" );
}
|
The test against the pattern and the transformation into a standard phone number if the pattern matches occurs in one step.
PHP applications manage increasingly large amounts of data. Whether you need to validate form input or decompose content, regular expressions can do the trick.
Learn
-
Read "How to use
regular expressions in PHP."
-
Check out the PCRE Regular
Expression Functions at PHP.net.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
-
Participate in the developerWorks PHP Forum: Developing PHP applications with IBM Information Management products (DB2, IDS).
Comments (Undergoing maintenance)





