Mastering regular expressions in PHP, Part 1: Perl may be regex king, but PHP can slice and dice input quickly, too

Pattern matching is such a common chore for software that a special shorthand — regular expressions — has evolved to make light work of the task. Learn how to use this shorthand in your code here in Part 1 of this "Mastering regular expressions in PHP" series.

Martin Streicher (martin.streicher@gmail.com), Editor in Chief, McClatchy Interactive

Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986.


developerWorks Contributing author
        level

01 January 2008

Also available in Russian Japanese Portuguese

All machines consume input, perform some sort of work, and yield output. A telephone, for example, converts sound energy to an electrical signal and back again to audio to enable conversation. An engine imbibes fuel (steam, fission, petrol, or elbow grease) and transforms it into work. And a blender devours rum, ice, lime, and curacao, and stirs vigorously to produce a Mai Tai. (Or, if you prefer something more metropolitan, try some champagne and pear nectar to enjoy a Bellini. The blender is truly a flexible and remarkable machine.)

Because software transforms data, each application is also a machine — albeit a "virtual" one, given the absence of physical parts. A compiler, for instance, expects source code as input and transmutes it to binary code suitable for execution. A weather modeler yields predictions based on historical measurements. And an image editor consumes and emits pixels, applying rules to each pixel or groups of pixels to, say, sharpen or stylize an image.

Just like any other machine, a software application expects certain raw materials, such as a list of numbers, data encapsulated in an XML schema, or a protocol. If a program is fed the wrong material — divergent in type or form — the result is likely to be unpredictable, even catastrophic. As the adage says, "Garbage in, garbage out."

Virtually all nontrivial problems require you to filter good data from bad, reject bad data to prevent errant output, or both. This is certainly the case for a PHP Web application. Whether input comes from a manual form or a programmatic Asynchronous JavaScript + XML (Ajax) request, the program must vet the incoming information before any computation can occur. A numeric value may have to lie within a certain range or be restricted to a whole number. A value may need to match a specific format, such as a postal delivery code. For example, a U.S. ZIP code is five digits plus an optional "Plus 4" qualifier composed of a hyphen and four additional digits. Other strings may have to be a certain number of characters, too, such as two letters for a U.S. state abbreviation. Strings are particularly nefarious: A PHP application must remain vigilant against a malicious actor embedding SQL queries, JavaScript code, or other code capable of altering the behavior of the application or circumventing security.

But how does a program tell whether input is numeric or conforms to a convention, such as a postal code? Fundamentally, performing a match requires a small parser — craft a state machine, read input, process tokens, monitor state, and yield a result. However, even a simple parser can be painful to create and maintain.

Luckily, pattern-matching analyses are so commonly required in computing that a special shorthand and, yes, engine have evolved over time (since the dawn of UNIX® or so) to make light work of the chore. A regular expression (regex) describes patterns in a concise, readable notation. Given a regex and a datum, a regex engine yields whether the datum matches a pattern and, if a match was found, what matched.

Here's a brief example of applying a regex, drawn from the UNIX command-line utility grep, which searches for a specified pattern among the content of one or more UNIX text files. The command grep -i -E '^Bat' searches for the sequence beginning-of-line (indicated with the caret, [^]), followed immediately by upper- or lowercase letters b, a, and t (the -i option ignores case in pattern matches, so B and b are equivalent, for instance). Hence, given the file heroes.txt:

Listing 1. heroes.txt
Catwoman
Batman
The Tick
Black Cat
Batgirl
Danger Girl
Wonder Woman
Luke Cage
The Punisher
Ant Man
Dead Girl
Aquaman
SCUD
Blackbolt
Martian Manhunter

The aforementioned grep command would yield two matches:

Batman
Batgirl

Regular expressions

PHP offers two regex programming interfaces, one for Portable Operating System Interface (POSIX) and another for Perl Compatible Regular Expressions (PCRE). By and large, the latter interface is preferred, because PCRE is much more powerful than the POSIX implementation, offering all the operators found in Perl. Read the PHP documentation to learn more about the POSIX regex function calls (see Resources). Here, I focus on the PCRE features.

A PHP PCRE regex contains operators to match against specific characters and other operators; against a specific location, such as the start or end of a string; or against the beginning or end of a word. A regex can also describe alternates, which you might describe as "this" or "that"; fixed-, variable-, or indefinite-length repetition; sets of characters (for example, "any of the letters from a to m"); and classes, or kinds of characters (printable characters or punctuation), among other techniques. Special operators in regexes also permit grouping — a way to apply an operator to other operators en masse.

Table 1 shows some common regex operators. You can concatenate and combine the primitives in Table 1 (and other operators) and use them in combination to build (very) complex regexes.

Table 1. Common regex operators
OperatorPurpose
. (period)Match any single character
^ (caret)Match the empty string that occurs at the beginning of a line or string
$ (dollar sign)Match the empty string that occurs at the end of a line
AMatch an uppercase letter A
aMatch a lowercase letter a
\dMatch any single digit
\DMatch any single nondigit character
\wMatch any single alphanumeric character; a synonym is [:alnum:]
[A-E]Match any of uppercase A, B, C, D, or E
[^A-E]Match any character except uppercase A, B, C, D, or E
X?Match none or one capital letter X
X*Match zero or more capital Xes
X+Match one or more capital Xes
X{n}Match exactly n capital Xes
X{n,m}Match at least n and no more than m capital Xes; if you omit m, the expression tries to match at least nXes
(abc|def)+Match a sequence of at least one abc and def;abc and def would match

Here's an example of a common use of a regex. Say your Web site requires each user to create a login. Each user name must be at least three, but not more than 10 alphanumeric characters and must begin with a letter. To enforce these specifications, you could use the following regex to validate the user name when it's submitted to your application: ^[A-Za-z][A-Za-z0-9_]{2,9}$.

The caret matches the beginning of the string. The first set, [A-Za-z], represents any letter. The second set, [A-Za-z0-9_]{2,9}, represents a series of at least two and up to nine of any letter, any digit, and the underscore. And the dollar sign ($) matches the end of the string.

At first glance, the dollar sign may seem unnecessary, but it's critical. If you omit it, your regex would match any string that begins with a letter, contains two to nine alphanumeric characters, and any number of any other characters. In other words, without the dollar sign to anchor the end of the string, a very long string with a matching prefix, such as "martin1234-cruft," would yield a false positive.


Programming PHP and regexes

PHP provides functions to find matches in text, to replace each match with other text (a la search and replace), and to find matches among the elements of a list. The functions are:

  • preg_match()
  • preg_match_all()
  • preg_replace()
  • preg_replace_callback()
  • preg_grep()
  • preg_split()
  • preg_last_error()
  • preg_quote()

To demonstrate the functions, let's write a small PHP application that searches a list of words for a specific pattern, where the words and the regex are provided by a traditional Web form, and the results are echoed to the browser using the simple print_r() function. Such a little program is useful if you want to test or refine a regex.

Listing 2 shows the PHP code. All the input is provided through a simple HTML form. (The corresponding form is not shown, and code to trap errors in the PHP code has been omitted for brevity.)

Listing 2. Compare text to a pattern
<?php
	//
	// divide the comma-separated list into individual words
	//   the third parameter, -1, permits a limitless number of matches
	//   the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
	//
	$words = preg_split( '/,/',  $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );

	//
	// remove the leading and trailing spaces from each element
	//
	foreach ( $words as $key => $value ) { 
		$words[ $key ] = trim( $value ); 
	}

	//
	// find the words that match the regular expression
	//
	$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );

	print_r( $_REQUEST['regex' ] ); 
	echo( '<br /><br />' );
	
	print_r( $words ); 
	echo( '<br /><br />' );
	
	print_r( $matches );
	
	exit;
?>

First, the string of comma-separated words is divided into individual elements using the preg_split() function. This function divides the string at every point that matches the provided regex. Here, the regex is simply , (a comma, the eponymous delimiter of a comma-separated list). The leading and trailing slash shown in the code simply indicate the start and end of the regex.

The third and fourth arguments of preg_split() are optional, but each is useful. Supply an integer, n, for the third argument to return only the first n matches; or supply -1 for all matches. If you specify a fourth argument, the flag PREG_SPLIT_NO_EMPTY, preg_split() disposes of any empty results.

Next, each element in the list of comma-separated words is trimmed (leading and trailing whitespace is elided) through the trim() function, then compared to the supplied regex. The function, preg_grep(), makes processing a list very easy: Simply provide the pattern as the first argument and an array of words to match as the second argument. The function returns an array of matches.

For example, if you type the regex ^[A-Za-z][A-Za-z0-9_]{2,9}$ as the pattern and a list of words of varied length, you might get something like Listing 3.

Listing 3. Result of a simple regex
^[A-Za-z][A-Za-z0-9_]{2,9}$

Array ( [0] => martin [1] => 1happy [2] => hermanmunster ) 

Array ( [0] => martin )

By the way, you can invert the preg_grep() operation and find elements that don't match the pattern (the same as grep -v on the command line) with the optional flag PREG_GREP_INVERT. Replacing line 22 with $matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words, PREG_GREP_INVERT ) and reusing the input of Listing 3 yields Array ( [1] => 1happy [2] => hermanmunster ).


Decomposing strings

The functions preg_split() and preg_grep() are great little functions. The former can decompose a string into substrings if the substrings are separated by a predictable pattern. The function preg_grep() can also filter a list quickly.

But what happens if a string must be decomposed using one or more complex rules? For instance, U.S. phone numbers often appear as "(305) 555-1212," "305-555-1212," or "305.555.1212." If you remove the punctuation, all reduce to 10 digits, which is easy to recognize as using the regex \d{10}. However, the three-digit area code and three-digit prefix of phone numbers in the United States cannot start with a zero or a one (because both are prefixes for nonlocal calls). Rather than split the numeric sequence into individual digits and write complex code, a regex can test for validity.

Listing 4 shows a snippet of code to perform the task.

Listing 4. Determine whether a phone number is a valid U.S. phone number
<?php   
	$punctuation = preg_quote( "().-" );
	$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );

	$valid = "/[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/";	
	if ( preg_match( $valid, $number ) == 1 ) {
		echo(  "${_REQUEST[ 'number' ]} is valid<br />" );
	}
		
	exit;
?>

Let's step through the code:

  • As shown in Table 1, regexes use a small set of operators, such as brackets ([ ]), to name a set. If you want to match such an operator in subject text, you must "escape" the operator in the regex with a preceding backslash (\). After you escape the operator, it matches like any other literal. For instance, if you want to a match a literal period, say, as found in a fully qualified host name, write \.. Optionally, you can pass a string to preg_quote() to automatically escape any regex operator it finds, as in line 1. If you use echo() $punctuation after line 1, you should see \(\)\.-.
  • Line 2 removes all punctuation from the phone number. The preg_replace() function replaces any occurrence of a character in $punctuation— hence, the set operators [ ]— with the empty string, effectively eliding the characters. The new string is returned and assigned to $number.
  • Line 4 defines the pattern for a valid U.S. telephone number.
  • Line 5 performs the match, comparing the now digits-only phone number to the pattern. The function preg_match() returns 1 if there is a match. If no match is found, preg_match() returns a zero. If an error occurred during processing, the function returns False. Thus, to check for success, see if the return value is 1. Otherwise, check the result of preg_last_error() (if you use PHP V5.2.0 or later). If not zero, you may have exceeded a computing limit, such as how deeply a regex can recurse. You can find a discussion of the constants and limits used with PHP regexes on the PCRE Regular Expression Functions page (see Resources).

Captures

There are many instances when a "Does this match?" test is all that's needed — as in data validation. More often, though, a regex is used to prove a match and to extract information about the match.

Returning to the example of the telephone number, if a match is made, you may want to store the area code, prefix, and line number in separate fields of a database. Regexes can remember what's been matched with capture. The capture operator is the parentheses, and the operator can appear anywhere in the regex. You can also nest captures to find subsegments of larger captures. For instance, to capture the area code, prefix, and line number of the 10-digit telephone, you can use:

/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/

If a match is made, the first three digits are captured in the first set of parentheses, the next three digits in the second set; and the final four digits in the remaining operator. A variation of the preg_match() call retrieves the captures.

Listing 5. How preg_match() retrieves the captures
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";	
if ( preg_match( $valid, $number, $matches ) == 1 ) {
	echo(  "${_REQUEST[ 'number' ]} is valid<br />" );
	echo(  "Entire match: ${matches[0]}<br />" );
	echo(  "Area code: ${matches[1]}<br />" );
	echo(  "Prefix: ${matches[2]}<br />" );
	echo(  "Number: ${matches[3]}<br />" );
}

If you provide a variable as the third argument to preg_match(), such as $matches here, it is set to a list of capture results. The zeroth element (indexed with 0) is the entire match; the first element is the match associated with the first set of parentheses, etc., respectively.

Nested captures capture segments and subsegments, to virtually any depth. The trick with nested captures is predicting where each match appears in a match array, such as $matches. Here's the rule to follow: Count the number of left parentheses from the beginning of the regex — the count is the index to the match array.

Listing 6 provides a (somewhat contrived) example to extract pieces of a street address.

Listing 6. Code to extract a street address
$address = "123 Main, Warsaw, NC, 29876";

$valid = "/((\d+)\s+(\w+)),\s+(\w+),\s+([A-Z]{2}),\s+(\d{5})/";

if ( preg_match( $valid, $address, $matches ) == 1 ) {
	echo(  "Street: ${matches[1]}<br />" );
	echo(  "Street number: ${matches[2]}<br />" );
	echo(  "Street name: ${matches[3]}<br />" );
	echo(  "City: ${matches[4]}<br />" );
	echo(  "State: ${matches[5]}<br />" );
	echo(  "Zip: ${matches[6]}<br />" );
}

Again, the entire match is found at index 0. Where is the street number found? Counting from left, the street number is matched by \d+. The enclosing left parenthesis is second from left; hence, $matches[2] is 123. $matches[4] holds the city name, while $matches[6] captures the ZIP code.


Power techniques

Processing text is very common, and PHP provides a few features that make large numbers of operations easier. Here are a few shortcuts to keep in mind:

  • The preg_replace() function can operate on a single string or an array of strings. If you call preg_replace() with an array of strings rather than a string, all the elements in the array are processed to make replacements. In this case, preg_replace() returns an array of modified strings.
  • As with other PCRE implementations, you can refer to a subpattern match from within the replacement, allowing an operation to be self-referential. To demonstrate, consider the problem of unifying a phone number format. All the punctuation is stripped, replaced with dots. One solution is shown in Listing 7.
Listing 7. Replacing punctuation with dots
$punctuation = preg_quote( "().-" );
$number = preg_replace( "/[$punctuation]/", '', $_REQUEST[ 'number' ] );
$valid = "/([2-9][0-9]{2})([2-9][0-9]{2})([0-9]{4})/";	

$standard = preg_replace( $valid, "\\1.\\2.\\3", $number ); 
if ( strcmp ($standard, $number) ) {
	echo(  "The standard number is $standard<br />" );
}

The test against the pattern and the transformation into a standard phone number if the pattern matches occurs in one step.


Express yourself

PHP applications manage increasingly large amounts of data. Whether you need to validate form input or decompose content, regular expressions can do the trick.

Resources

Learn

Get products and technologies

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.
  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=276395
ArticleTitle=Mastering regular expressions in PHP, Part 1: Perl may be regex king, but PHP can slice and dice input quickly, too
publish-date=01012008