Although the terms data and information are used interchangeably, there's a big difference between the two. Data is factual. Data is a list of temperatures, a litany of recent sales, or an inventory of parts on hand. Information is insightful. Information is a prediction of the weather, a profit-and-loss statement, and a sales trend. Data is recorded as ones and zeros. Information is coalesced by synapses.
Between data and information lies the software application: the engine that transforms the former to the latter and vice versa. For example, if you buy a book online, a shopping application permutes your information — the book title, your identity, your bank account information — into data, such as an order number, a sales price, details of the credit card transaction, and an adjustment to on-hand inventory. Similarly, the shopping application remakes the data into a pick request for the warehouse and a shipping label and tracking number — information required to effectuate the sale.
Indeed, the complexity of creating an application is directly proportional to the transformations it affects. A Web-site guest book, which translates a name and address to fields in a database, is simple. Meanwhile, an online store, which translates many kinds of information into the data model of the business and converts the data to information to inspire decision-making, is quite elaborate. The art of programming is the adept manipulation of data and information — a skill akin to capturing light in chiaroscuro.
As introduced in Part 1, regexes are one of the
most powerful tools for manipulating data. Using a concise shorthand, regexes describe
the form of data and decompose it. For example, you could use the following regex to
process any temperature in Celsius or Fahrenheit: /^([+-]?[0-9]+)([CF])$/.
The regex matches the beginning of the line (represented by the caret,
^), followed by a positive sign, a negative sign, or neither
([+-]?), followed by an integer ([0-9]+), a scale qualifier — one of Celsius or Fahrenheit
([CF]) — and terminated by the end of line
(represented by the dollar sign, $).
In the temperature regex, the beginning-of-line and end-of-line operators are two examples of zero-width assertions, or matches that are positional, not literal. The parentheses aren't literal, either. Instead, embedding a pattern within parentheses captures the text that matches the pattern. Hence, if text matched the entire pattern, the first set of parentheses would yield a string representing a positive or negative integer, such as +49. The second set of parentheses would yield either the letter C or F.
Part 1 introduces the notion of the regex and the PHP functions available to compare text to patterns and extract matches. Now I delve more deeply into regexes and look at a handful of advanced operators and recipes.
Parentheses to the rescue (again)
Most of the time, you use a set of parentheses to define a subpattern and to capture the text that matches the subpattern. However, parentheses need not capture the subpattern. As in a complex arithmetic formula, you can use parentheses simply to group terms.
Here's an example. Can you tell what kind of data it matches?
/[-a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info)/i |
This regex matches host names (albeit solely within the .com, .edu, and .info domains), as
you may have predicted. What's different is the addition of ?:. The subpattern qualifier ?: disables capture, leaving the
parentheses to clarify the precedence of operation. Here, for example, the phrase (?:\.[-a-z0-9]+)* matches zero or more instances of a string, such
as ".ibm." Similarly, the phrase \.(?:com|edu|info)
expresses a literal period followed by any of the strings com, edu, or info.
Disabling capture may seem pointless, until you realize that capture requires extra processing. If your code processes a lot of data, omitting capture might be worthwhile. Additionally, if your regex is particularly intricate, disabling capture in certain subpatterns can make it easier to extract the subpatterns you're truly interested in.
Note: The i modifier at the end of the regex makes
all matches within the pattern, case-insensitive. The subset a-z, therefore, matches all letters, independent of case.
PHP offers other subpattern modifiers. Using the regex test jig provided in Part 1
(repeated here in Listing 1), match the regex ((?i)edu) against candidate strings "EDU," "edu," and "Edu." If you
begin a subpattern with the modifier (?i), matching in the
subpattern is case-insensitive. Case-sensitivity is re-enabled as soon as the subpattern
ends. (Compare this to the /
...
/i modifier above, which applies to the entire pattern.)
Listing 1. A simple regex test utility
<?php
//
// divide the comma-separated list into individual words
// the third parameter, -1, permits a limitless number of matches
// the fourth parameter, PREG_SPLIT_NO_EMPTY, ignores empty matches
//
$words = preg_split( '/,/', $_REQUEST[ 'words' ], -1, PREG_SPLIT_NO_EMPTY );
//
// remove the leading and trailing spaces from each element
//
foreach ( $words as $key => $value ) {
$words[ $key ] = trim( $value );
}
//
// find the words that match the regular expression
//
$matches = preg_grep( "/${_REQUEST[ 'regex' ]}/", $words );
print_r( $_REQUEST['regex' ] );
echo( '<br /><br />' );
print_r( $words );
echo( '<br /><br />' );
print_r( $matches );
exit;
?>
|
Another useful subpattern modifier is (?x). It lets you
embed whitespace in a subpattern, making the regex easier to read. Thus, the subpattern
((?x) edu | com | info) (notice the spaces between the
alternation operators, added for legibility) is the same as (edu|com|info). You can use the global modifier /
...
/x to embed whitespace and
comments in the entire regex, as shown below.
Listing 2. Embed whitespace and comments
$matches = preg_grep(
"/
[- a-z 0-9]+ # machine name
(?: \. [- a-z 0-9]+)* # subdomains
\. (?: com | edu | info)# domain
/xi", $words );
|
As you can see, you can also combine modifiers as needed. Also, if you need to
match a literal space while using (?x), say, use the
metacharacter \s to match any whitespace character or \ (the backslash followed by a space) to match a single space, as in ((?x) hello \ there).
The vast majority of regex use validates or decomposes input into individual tidbits stored as data in a repository or acted upon immediately by the application. Processing fields in a form, parsing XML code, and interpreting a protocol are canonical uses.
Another use of regex is formatting, or normalizing or improving, the readability of data. Rather than use regex to find and extract text, formatting uses regex to find and insert text at the proper position.
Here's a useful application of formatting. Assume that a Web form submits a salary in whole dollars to your application. Because you store salary as an integer, your application must strip punctuation from the posted data before it's persisted. However, when the data is retrieved from the repository, you want to reformat it to be readable, using commas. A simple PHP call to convert the dollar amount to a number is shown below.
Listing 3. Convert dollar amounts to a number
$salary = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );
if ( is_numeric( $salary ) ) {
// persist the data
}
else {
// error
}
|
The call to the preg_replace() function replaces the dollar
sign, any whitespace, and every comma with the empty string, yielding what's supposed
to be an integer. If the call to is_numeric() validates the input, the data can be stored.
Next, let's reverse the operation to emit the number with a currency symbol and commas
to separate hundreds, thousands, and millions. You could write code to find those
units, or you can use look ahead and look behind to insert commas at the
proper position. The subpattern modifier ?<= denotes
look behind (that is, look left) of the current position. The modifier ?= stands for look ahead (look right) of the current position.
So, what's the proper position? Anyplace in the string where there's at least one digit to the left and one or more groups of three digits to the right, excluding the decimal point and the number of cents. Given that rule and the two look-around modifiers, both of which are zero-width assertions, this statement does the trick:
$pretty_print = preg_replace( "/(?<=\d)(?=\d\d\d)+$)/", ',', $salary ); |
How does the latter regex work? Beginning at the start of the string and proceeding through each position, the regex asserts, "Is there at least one digit to the left and one or more groups of three digits to the right?" If so, a comma "replaces" the zero width assertion.
Many complex matches can be dispensed with easily using a strategy similar to the one above. For instance, here's another use of look-ahead that readily solves a common dilemma.
Listing 4. Look-ahead example
$tab_data = preg_replace( '/
, # look for a comma
(?= # then look ahead for
(?:[^"]*$) # a string with no quotes and eol
| # -or-
(?:[^"]*"[^"]*"[^"]*)*$ # a string with balanced quotes
) #
/x', "\t", $csv_data );
|
This preg_replace() instruction transforms a line of
comma-separated data into a line of tab-separated data. Wisely, it doesn't replace a
comma found within a quoted string.
The regex makes an assertion at every occurrence of a comma (that's the comma at the
very beginning of the regex): "Are no quotes ahead or is an even number of quotes
ahead?" If the assertion is true, the comma may be replaced with a tab (the \t).
If you don't like the look-around operators, or if you're working in a language that does not provide them, you can embed commas in a number using a traditional regex, although doing so requires many iterations to accomplish. Following is one possible solution.
Listing 5. Embedding commas
$pretty_print = preg_replace( "/[\$\s,]/", '', $_REQUEST[ 'salary' ] );
do {
$old = $pretty_print;
$pretty_print = preg_replace( "/(\d)(\d\d\d\b)/", "$1,$2", $pretty_print );
} while ( $old != $pretty_print );
|
Let's step through the code. First, a salary parameter is stripped of punctuation to
simulate a read of an integer from the database. Next, the loop repeats, finding
positions where a single digit ((\d) is followed by triple
digits ((\d\d\d\) terminated immediately at a word boundary,
designated by \b. A word boundary is another zero width assertion and is defined to be:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between a word character and a nonword character immediately after the word character.
- Between a nonword character and a word character immediately after the nonword character.
Hence, whitespace, a period, and a comma are each a valid word boundary.
Because of the outer loop, the regex essentially advances right to left looking for a
digit followed by three digits and a word boundary. If a match is found, a comma is
inserted between the two subpatterns. As long as preg_replace() finds a match, the loop must continue, which explains
the condition $old != $pretty_print.
Regexes are quite powerful. Sometimes, a little too powerful. For example, consider
what happens when the regex ".*" is applied to the string
"The author of 'Wicked' also wrote 'Mirror, Mirror.'" While you might expect preg_match() to return two matches, you may be surprised to find
just a single result: 'Wicked' also wrote 'Mirror, Mirror.'
The cause? Unless you specify otherwise, operators such as *
(none or more) and + (one or more) are greedy. If a
pattern can continue to match, it will, yielding the largest result possible. To keep
matches minimal, you must force certain operators to be lazy. Lazy
operations find the shortest match, then stop. To make an operator lazy, add a
question-mark suffix. Listing 6 shows an example.
Listing 6. Adding a question-mark suffix
$text = 'The author of "Wicked" also wrote "Mirror, Mirror."';
if ( preg_match_all( '/".*?"/', $text, $matches ) ) {
print_r( $matches[0] );
}
|
The snippet above produces:
Array ( [0] => "Wicked" [1] => "Mirror, Mirror." ) |
The regex ".*?" translates to "match a quote, followed by
just enough characters, followed by a quote.
Sometimes, however, the * operator can be too lazy. Take the
following snippet of code, for instance. What does it produce?
Listing 7. A simple regex test utility
if (preg_match( "/([0-9]*)/", "-123", $matches ) ) {
print_r( $matches );
}
|
What did you guess? "123"? "1"? No output? In fact, the output is Array ( [0] => [1] => ), meaning a match was made, but nothing was
captured. Why? Recall that the operator * can match zero or
more times. Here, the expression [0-9]* matches zero times
against the beginning of the string and processing halts.
To fix the problem, add a zero-width assertion to anchor the match, which forces the
regex engine to continue matching; /([0-9]*\b/ suffices.
Regexes can solve simple or difficult text processing problems. Begin with a handful of operators and expand your vocabulary as your experience grows. To jump-start your endeavor, here are a handful of tips and tricks.
Make your regexes portable with character classes
You've seen metacharacters, such as \s, that match any
whitespace characters. In addition, many regex implementations support predefined
classes of characters that are easier to use and portable across multiple written
languages. For instance, the character class [:punct:]
stands for all punctuation characters in the current locale. You can use [:digit:] in place of [0-9], and [:alpha:] is a more portable replacement for [-a-zA-Z0-9_]. For example, you can strip all punctuation from a string by using the statement:
$clean = preg_replace( "/[[:punct:]]/", '', $string ); |
The character class is more succinct than spelling out all the punctuation characters. Consult the documentation for your version of PHP for a complete list of character classes.
Exclude what you're not looking for
As shown in the comma-separated value (CSV)- to tab-delimited data example, it is
sometimes easier and more precise to list what you don't want to match. A set
that begins with a caret (^) matches any character not
included in the set. For example, you can use the regex
/[2-9][0-9]{2}[2-9][0-9]{2}[0-9]{4}/ to validate U.S. phone numbers. Using an
exclusionary set, you could write the regex as the more explicit /[^01][0-9]{2}[^01][0-9]{2}[0-9]{4}/. Both regexes work, although the
latter is arguably more plain in its intent.
If your input spans multiple lines, a typical regex won't suffice because scanning
terminates at a newline, denoted by $. However, if you use
the s or m modifier, the regex
engine treats the input differently. The former treats a string as a single line,
forcing the dot to match a newline (it typically does not). The latter treats the
string as multiple lines, where ^ and $ match the beginning and end of any line, respectively. Here's an
example: If you set $string = "Hello,\nthere";, the
statement preg_match( "/.*/s", $string, $matches) would set
$matches[0] to Hello,\nthere.
(Removing the s yields Hello.)
With regular expressions, your imagination and ingenuity are virtually the only limits to what you can achieve.
Learn
-
Read the other articles in this "Mastering
regular expressions in PHP" series.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
-
Participate in the developerWorks PHP Forum: Developing PHP applications with IBM Information Management products (DB2, IDS).
Comments (Undergoing maintenance)





