While you might think writing Java applications that need to analyze text is a simple task, like many things, it can quickly become complicated. That was certainly my experience when I was writing code to parse HTML pages. I began by using Perl5 regular expressions (regexps) occasionally. However, for reasons I'll explain, I ended up using them frequently.
In my experience, most Java developers need to parse some kind of text.
Usually, this means they initially spend some time working with Java
string-related functions or methods like indexOf or
substring, hoping that the input format never changes.
However, if the input format changes, the code for reading the new
format becomes more sophisticated and difficult to maintain. Eventually
the code may need to support word wrapping, case sensitivity, and so on.
As the logic becomes more complex,maintenance becomes the drawback. Because any change may produce side effects and cause other parts of the text parser to stop working, the developer needs time to fix small bugs.
A developer with some Perl experience already probably
has some experience working with regular expressions, too.
If lucky (or good) enough, the developer is able to convince the rest of
the team (or at least the team leader) to use this technology.
Instead of lines of code with calls to String methods, the new approach implies
delegating the core of the parser logic and replacing it with a regexp
library.
By accepting the suggestion of the Perl5-experienced
developer, the team must choose which regex
implementation best suits their project. Then they need to learn how to use it.
After a short study of many alternatives found on the Internet, suppose
the team decides to use one of the better-known libraries, such as Oro,
belonging to the Jakarta project.
Next, the parser is heavily refactored or almost rewritten and ends up using Oro's classes
like Perl5Compiler, Perl5Matcher, and so on.
The consequences of this decision are clear:
- The code is heavily coupled to Jakarta Oro's classes.
- The team takes a risk by not knowing whether the non-functional requirements (like
the performance or threading models) will be satisfied.
- The team has spent time and money to learn and rewrite the code so it uses the regexp library.
If their decision is wrong and a new library is chosen, this effort won't make a
significant difference in cost, because the code will need to be rewritten again.
- Even if the library works fine, they may decide that they should migrate to a brand new one (for example, the one included with JDK 1.4)?
Is there any way for the team to know which implementation best suits their needs, not only in the present but also in the future? Let's try to find an answer.
Avoid being dependent on any specific implementation
The previous story is quite common in software engineering. In some cases, such situations can produce high investment and long delays. This usually happens when a decision is made without knowing all the consequences and the decision maker is unlucky or lacks the required experience.
The situation can be summarized as follows:
- You need a provider of some kind
- You don't have an objective criteria to choose which is the best provider
- You'd like to be able to evaluate all candidates with the minimum of cost
- The decision shouldn't tie you to the chosen provider
The solution to this problem is to make your code more independent of the provider. This introduces a new layer -- one that decouples both the client and the provider.
In server-side developments, it's easy to find patterns or architectures that use this approach. To cite some examples:
- With J2EE, you focus on building your application rather than application server details.
- The Data Access Object (DAO) pattern hides the details and complexity of how you access the database (or the LDAP server, the XML file, and so on) as it provides a way to access an abstract persistence layer and you save the need to deal with database issues inside your client code, where the data is actually stored. It's not a Gang of Four (GoF) pattern, but part of Sun's J2EE best practices.
In the fictitious development team example, they're looking for a layer that:
- Abstracts the concepts behind all regular expression
implementations. The team could then focus on learning and understanding
such concepts. What they learn would be applicable to any implementation or version.
- Supports new libraries without side effects. Based on a plug-in architecture, the actual library
that executes the regexp patterns is chosen dynamically and the adapters wouldn't be coupled.
A new library would only introduce the need for a new adapter.
- Provides a way to compare the different alternatives. A simple benchmark utility could show interesting performance measures. If they execute such a utility for each implementation, the team would get valuable information and could choose the best alternative.
Any decoupling approach has at least one disadvantage: What if the client code needs specific functionality provided by only one of the implementations? You cannot use any other implementation so you end up coupling your code to it. Maybe in the future that enhancement will be included, but there's little you can do for now.
Examples of this are not as rare as you might think. In the regexp world, there are some compiler options supported by only certain implementations. If your client code needs this kind of specific functionality, then this generic layer is not enough -- at least as it has been described so far.
Should the additional layer support all non-common functionalities of each implementation and throw exceptions if one that does not support it is chosen? That could be a solution, but it doesn't support the original goal of just defining the common abstract concepts.
One GoF pattern could fit perfectly in this situation: Chain of Responsibility. It introduces another indirection in the design. With this approach, the client code sends a message or a command to a list of entities able to process such a message. The list items are organized in a chain so the message is processed in order and can be consumed before reaching the end of the chain.
In this case, the specific functionalities that are only supported by certain implementations could be modeled by special types of messages. It's up to each item in the chain either to pass the message to the next one or not, depending on whether the items know about such functionalities.
The API described here is called RegexpPlugin.
It's been designed following the approach just discussed and
supports decoupling between the regexp library and the code that uses it.
In the following examples, I'll summarize the differences between using a concrete
implementation (Jakarta Oro) and the RegexpPlugin API.
I start with a very simple regexp: Imagine that the text you have to parse is just the
name of a person. The format you receive is something like John A. Smith and you only want to get the
first name (John). But you don't know whether the words are separated by spaces, line breaks, tabs, or some combination of these. The regexp able to process such an input format is just .*\s*(.*?)\s+.* I'll
explain step by step how to extract the information using this regexp.
The first part is the period asterisk characters, .*, which here means anything before any number of spaces and the group (.*?). The second part is of interest (because parentheses surround it). The question mark means take the first one that matches the condition.
What follows indicates any number of spaces, new lines, or tabs (\s), but at least one
(+). The final period asterisk, .*, just represents the rest of the text (which isn't if interest).
So, this regexp is equivalent to: Take the first text that precedes a space. Let's write the Java code.
To use regular expressions inside your Java code, you usually need to complete the following seven steps:
Step 1: Create a compiler instance. Using Jakarta Oro, you have to
instantiate a Perl5Compiler:
org.apache.oro.text.regex.Perl5Compiler compiler =
new org.apache.oro.text.regex.Perl5Compiler();
|
The equivalent code using the RegexpPlugin is similar:
org.acmsl.regexpplugin.Compiler compiler =
org.acmsl.regexpplugin.RegexpManager.createCompiler();
|
There's a difference, though. As previously mentioned, this API
hides which concrete implementation is actually used.
You can choose a concrete one or leave the default
Jakarta Oro. If the chosen library is not available at runtime,
the RegexpPlugin API tries to create a compiler using
its class name. If that operation fails, it sends the exception back to the client of the API.
Suppose that you always use JDK 1.4's built-in regexp classes.
If so, there's no point including additional jar files that will never be used.
That's why just invoking the createCompiler()
method is not enough. You need to manage the exception
that is thrown whenever the chosen library is not present. The example, then, has to be updated:
try
{
org.acmsl.regexpplugin.Compiler compiler =
org.acmsl.regexpplugin.RegexpManager.createCompiler();
}
catch (org.acmsl.regexpplugin.RegexpEngineNorFoundException exception)
{
[..]
}
|
Step 2: Compile the regexp pattern. The regular expression itself is compiled into a
Pattern object.
org.apache.oro.text.regex.Pattern pattern =
compiler.compile(".*\\s*(.*?)\\s+.*", Perl5Compiler.MULTILINE_MASK);
|
Note that you have to escape the slash (\) characters.
This pattern object represents the regular expression that is defined in text format. Reuse pattern instances as much as possible. Then, if the regexp is fixed (lacking any variable part such as "(.*?)Tom.*"), the pattern should be a static member in the class.
The compile method is suitable for
being configured with flags, like
EXTENDED_MASK. (See Resources
for a more detailed regexp tutorial.) However, RegexpPlugin doesn't allow arbitrary flags.
The only supported ones are
case sensitivity and
multiline, because all supported
libraries can handle them.
The compiler instance has specific properties to define such flags:
compiler.setMultiline(true);
org.acmsl.regexpplugin.Pattern pattern =
compiler.compile(".*\\s*(.*?)\\s+.*");
|
Step 3: Create a Matcher object. In Jakarta Oro, this step is very straightforward:
org.apache.oro.text.regex.Perl5Matcher matcher =
new org.apache.oro.text.regex.Perl5Matcher();
|
It's so simple because it doesn't need any information to be constructed. It'll become specific for your regexp
later. Basically, in RegexpPlugin the step is more or less similar. Instead of creating the
matcher yourself, you delegate it to the RegexpManager class:
org.acmsl.regexpplugin.Matcher matcher =
org.acmsl.regexpplugin.RegexpManager.createMatcher();
|
The difference is that, as before, you need to deal with RegexpEngineNotFoundException.
Actually, RegexpManager needs to create a matcher adapter for
your chosen library or for the default one. If such classes are not available at runtime, it'll throw that exception.
Step 4: Evaluate the regular expression. The matcher object needs to
interpret the regular expression and extract the information needed. This is done in a single line:
if (matcher.contains("John A. Smith", pattern))
{
|
If the input text matches the regular expression, what this method returns is true.
The implicit side effect is that after this line the matcher object contains the first
match found in the input text. The next step shows how to actually get the information you're interested in.
Using the RegexpPlugin API, there's no difference at all at this point.
Step 5: Retrieve the first match found. This simple step is done in only one line:
org.apache.oro.text.regex.MatchResult matchResult = matcher.getMatch();
|
You declare a local variable to store the object that has the piece of text that matches the regexp. In both cases, the step is the same, except for the variable declaration (since one is an adapter of the other):
org.acmsl.regexpplugin.MatchResult matchResult =
matcher.getMatch();
|
Step 6: Get the group you're interested in. You can use
two approaches:
- A concrete library
- The
RegexpPluginAPI
Since your regexp is .*\s*(.*?)\s+.*>, you have only one group: (.*?)
The MatchResult object contains
all groups in an ordered list. You only need to know the position of the group that you want to get. Since this example only has one group, there's no doubt:
String name = matchResult.group(1);
[..]
}
|
The variable name now contains
the text John, which is exactly what you wanted.
Step 7: Repeat the process if needed.Step 7: Repeat the process if needed. If the information you need can appear more than once, and you want to analyze all occurrences instead of just the first one, then you only have to include steps 5 through 7 in a loop, until the condition described in step 4 is not satisfied:
while (matcher.contains("John A. Smith", pattern))
{
|
Besides writing a common abstract API, the main effort is actually to implement the adapters to some of the already existing regexp engines in the Java environment.
The following tables provide insights in how to migrate from one library to another. In some cases, the concepts are cleanly separated. In others, it's not so clear.
| Regexp concept | GNU Regexp 1.2 |
| Compiler | gnu.regexp.RE |
| Pattern | gnu.regexp.RE |
| Matcher | gnu.regexp.REMatchEnumerationgnu.regexp.RE |
| Match result | gnu.regexp.REMatch |
| Malformed pattern exception | gnu.regexp.REException |
| Regexp concept | Jakarta Oro 2.0.6 |
| Compiler | org.apache.oro.text.regex.Perl5Compiler |
| Pattern | org.apache.oro.text.regex.Pattern |
| Matcher | org.apache.oro.text.regex.Perl5Matcher |
| Match result | org.apache.oro.text.regex.MatchResult |
| Malformed pattern exception |
org.[..].regex.MalformedPatternException
|
| Regexp concept | Jakarta Regexp 1.3 |
| Compiler | org.apache.regexp.REorg.apache.regexp.RECompilerorg.apache.regexp.REProgram |
| Pattern | org.apache.regexp.REProgramorg.apache.regexp.RE |
| Matcher | org.apache.regexp.REorg.apache.regexp.REProgram |
| Match result | org.apache.regexp.RE |
| Malformed pattern exception | org.apache.regexp.RESyntaxException |
| Regexp concept | JDK 1.4 regex package |
| Compiler | java.util.regex.Pattern |
| Pattern | java.util.regex.Pattern |
| Matcher | java.util.regex.Matcher |
| Match result | java.util.regex.Matcher |
| Malformed pattern exception | java.util.regex.PatternSyntaxException |
One of the clear uses of this API is to compare the differences between implementations, measuring performance, compatibility to Perl5 syntax, or other criteria.
The benchmarking utility developed for these tests uses an HTML parser
to process Web content, updating information about links, forms,
tables, and so on. However, what is important is that the parsing logic
is delegated to regular expressions and therefore goes through to the
RegexpPlugin API.
The benchmark consists of parsing a very simple HTML page 10,000 times. The results are shown in the following table.
| Regexp library | Benchmark result (seconds) |
| Jakarta Oro 2.0.6 | 130,71 |
| Jakarta Regexp 1.2 | 23,261 |
| GNU Regexp 1.1.4 | 1,966.939 |
| JDK1.4 | 33,222 |
You can improve the performance in a real application in several ways. Most important is that, when you work with regexp libraries, you don't need to compile patterns every time. Instead, you can compile them and reuse the respective instances. However, if the regexp itself is not fixed, then the compilation process cannot be skipped.
Since the benchmark needs to switch between implementations to compare performance, compiled patterns must always be discarded to avoid interaction between libraries. However, as you can see, most of the evaluated libraries have similar response times, although a more detailed benchmark should give better insights into how each library behaves under different circumstances.
Regular expression parsers are powerful. Once a team becomes comfortable, the parsing logic improves, which helps reduce maintenance. However, developers need to know about regexp syntax to understand how such code works. This article has explained how to use one of the libraries in a very simple example. Beyond that, it also described the advantages of using an additional layer to decouple client code from the regexp engine itself.
- See both the Java Regexp Plugin API and the Java HTML Parser at SourceForge.
- Read about the Jakarta ORO project and the Jakarta Regexp project at the Apache Web site.
- Explore the GNU Regexp Web site for useful information.
- Check out the Java 2 Platform, Standard Edition (SE) v1.4 Web site.
- Practice using regular expressions to parse sequences of characters in John Zukowski's article, "Magic with Merlin: Parse sequences of characters with the new regex library" (developerWorks August 2002).
- Take this Perl Regular Expression
tutorial.
- Read Jan Borsodi's regexp article.
- Take a look at the original GoF pattern book.
- Learn more about the Gang of Four and Java design patterns in these tutorials:
"Java design patterns 101"
(developerWorks January 2002) and "Java design patterns 201: Beyond the Gang of Four" (developerWorks April 2002).
- Additionally, look at the GoF pattern catalog.
- See how Extreme Programming, or XP, speeds your software development in these articles:
- "XP Distilled: How to achieve greater success on your Java projects" (developerWorks March 2001).
- "Demystifying Extreme Programming: "XP distilled" revisited, part 1" (developerWorks August 2002).
- "Demystifying Extreme Programming: "XP distilled" revisited, part 2" (developerWorks September 2002).
- Visit the dW Web architecture zone for more Web architecture resources.
Jose San Leandro Armendariz is an experienced software engineer who has worked on a number of J2EE projects over the past few years. You can contact Jose at jsanleandro@yahoo.es.