Parse text strings for patterns
Regular expressions are ways to match patterns against text -- similar to how a compiler works to generate class files. A compiler looks for various patterns in the source to convert the source code expressions into bytecodes. By recognizing these source patterns, the compiler is able to translate only valid representations of source into compiled class files.
In the context of regular expressions, patterns are text representations of sequences of characters. For instance, if you wanted to know if the word car existed within a character sequence, you would use the pattern car because that is how you represent the exact string. For a more complicated pattern, you can use special characters as placeholders. If instead of searching for car, you wanted to search for any string of text that began with the letter c and ended with the letter r, you would use the c*r pattern, where * represents any number of characters before the first r. The c*r pattern would match any string of characters that begins with c and ends with r, as in cougar, cavalier, or chrysler.
How to specify pattern expressions
The main part of pattern matching is coming up with the expression to use. This expression is then retained by the Pattern class before it is passed on to the Matcher class to check for matches in the context of a character sequence. For instance, if you want to validate an e-mail address, you might check whether the user input matches the pattern of a sequence of alphanumeric characters, followed by the @ symbol, then followed by two sets of characters separated by a period. This could be represented by the expression of \p{Alnum}+@\w+\.\p{Alpha}{2,3}. (Yes, this does oversimplify an e-mail address structure and probably would reject certain valid e-mail addresses, but as an example it's sufficient.)
Before we look at the specifics of the pattern language, let's look at \p{Alnum}+@\w+\.\p{Alpha}{2,3} in detail. The \p{Alnum} sequence means a single alphanumeric character (A through Z, a through z, or 0 through 9). The plus sign (+) after \p{Alnum} is called a quantifier. It is applied to the prior part of the expression and means that \p{Alnum} must be present one or more times. Use an asterisk (*) for zero or more times. The @ is just that, meaning it must appear after at least one alphanumeric character for the whole pattern to succeed. The \w+ is similar to the \p{Alnum}+, but adds an underscore ( _ ). Some sequences have multiple expressions. The slash ( \ .) means the period. Without the preceding slash, the period alone means any character. The final \p{Alpha}{2, 3} means two or three alphabetic characters.
The whole trick of working with patterns is to learn the specification language. Let's look at some of the classes of more commonly used expressions:
- Literals: Any character that doesn't have special meaning within an expression is considered a literal and matches itself.
- Quantifiers: Certain characters or expressions are used to count the number of times a literal or grouping can be present in a character sequence for the sequence to match an expression. Groupings are specified by a group of characters within parentheses.
- ? means once or not at all
- * means zero or more times
- + means one or more times
- Character classes: A character class is a set of characters within square brackets where a match would be any one character within the brackets. You can combine character classes with quantifiers, for example,
[acegikmoqsuwy]*would be any sequence of characters that include only the odd letters of the alphabet. Certain character classes are predefined:- \d -- A digit (from 0 to 9)
- \D -- A non-digit
- \s -- A white-space character, like tab or new line
- \S -- A non white-space character
- \w -- A word character (a through z, A through Z, 0 through 9, and underscore)
- \W -- A non-word character (everything else)
- Posix character classes: Certain character classes are valid for only US-ASCII comparison purposes. For instance:
- \p{Lower} -- Lowercase characters
- \p{Upper} -- Uppercase characters
- \p{ASCII} -- All ASCII characters
- \p{Alpha} -- An alphabetic character (combining \p{Lower} with \p{Upper})
- \p{Digit} -- A number from 0 to 9
- \p{Alnum} -- Alphanumeric characters
- Range: Use a dash to specify a character class for an inclusive range. For instance,
[A-J]means the uppercase letters from A through J. - Negation: The caret symbol ( ^ ) negates the contents of a character class. For instance,
[^A-J]means any character but A through J.
See the Pattern API documentation (available from Resources) for additional details on the sequences.
How to use patterns effectively
Now that you've learned how to specify patterns, let's use them. You need to ask the
Pattern class to compile them, as shown below. Notice that
the slash character ( \ ) needs to be escaped in the String constant.
Pattern pattern = Pattern.compile(
"\\p{Alnum}+@\\w+\\.\\p{Alpha}{2,3}");
|
After you have a compiled pattern, you can use the Pattern class
to split an input line into a series of words based upon the pattern, or use the Matcher class to do some more complicated tasks. Here's how to split a character sequence of input, where the pattern used specifies the separators, not the words:
String words[] = pattern.split(input); |
If you want to match a pattern multiple times within a character sequence, the above code snippets are a good place to start. But if you want to fetch specific input, you'll need the matcher() method of Pattern When given some input, this method will return the appropriate
Matcher class. You then use the Matcher instance to look through the results to find the different matches for the pattern in the input sequence, or better yet, use the Matcher instance as a search-and-replace tool:
Matcher matcher = pattern.matcher(input); |
To match the pattern against the whole sequence, use matches(). To see if just a part of the sequence matches, use find():
if (matcher.find()) {
// Found some string within input sequence
// That matched the compiled pattern
String match = matcher.group();
// Process matching pattern
}
|
These two classes -- Pattern and Matcher -- are the whole pattern-matching library. Coming up with the right regular expression and then working with the results of the Matcher class is really all there is to the library. Until a dedicated book on regular expressions comes out for the Java language, find a good book on Perl to learn more about the specific patterns. Listing 1 provides a complete example by looking for the longest word in a particular file passed in from the command line as input.
Listing 1. Longest word example
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.*;
import java.util.regex.*;
public class Longest {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("Provide a filename");
return;
}
try {
// Map File from filename to byte buffer
FileInputStream input =
new FileInputStream(args[0]);
FileChannel channel = input.getChannel();
int fileLength = (int)channel.size();
MappedByteBuffer buffer = channel.map(
FileChannel.MapMode.READ_ONLY, 0, fileLength);
// Convert to character buffer
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer charBuffer = decoder.decode(buffer);
// Create line pattern
Pattern linePattern =
Pattern.compile(".*$", Pattern.MULTILINE);
// Create word pattern
Pattern wordBreakPattern =
Pattern.compile("[\\p{Punct}\\s}]");
// Match line pattern to buffer
Matcher lineMatcher =
linePattern.matcher(charBuffer);
// Holder for longest word
String longest = "";
// For each line
while (lineMatcher.find()) {
// Get line
String line = lineMatcher.group();
// Get array of words on line
String words[] = wordBreakPattern.split(line);
// Look for longest word
for (int i=0, n=words.length; i<n; i++) {
if (words[i].length() > longest.length()) {
longest = words[i];
}
}
}
// Report
System.out.println("Longest word: " + longest);
// Close
input.close();
} catch (IOException e) {
System.err.println("Error processing");
}
}
}
|
- Read the API documentation for the java.util.regex package.
- Read the developerWorks Linux zone column, Cultured Perl, which may provide you with insight into regular expressions using the Java language.
- Read the complete collection of Merlin tips by John Zukowski.
- Find more Java technology resources on the developerWorks Java technology zone.

John Zukowski conducts strategic Java consulting with JZ Ventures, Inc. and serves as the resident guru for a number of jGuru's community-driven Java FAQs. His latest books are Learn Java with JBuilder 6 from Apress and Mastering Java 2: J2SE 1.4 from Sybex. Reach him at jaz@zukowski.net.