Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Build an abstract Java API for regular expressions

Motivations for building a generic API to use Perl5 regexp libraries

Jose Leandro Armendariz (jsanleandro@yahoo.es), Independent Software Engineer
Jose San Leandro Armendariz is an experienced software engineer who has worked on a number of J2EE projects over the past few years. You can contact Jose at jsanleandro@yahoo.es.

Summary:  When you work with regular expressions in Java, depending on a concrete regexp library is generally not a good idea. If you use an abstract layer, you can switch between different regexp libraries, reduce the coupling between your code and a particular library, and choose which one best fits your needs. If you are thinking about using a Java regexp library in your next project, software developer Jose San Leandro Armendariz shows you how to keep your code independent of your chosen concrete library. He'll give you a close look at regexps and how they work, then provide you with a little practice.

Date:  01 Dec 2002
Level:  Intermediate

Comments:  

Introduction

While you might think writing Java applications that need to analyze text is a simple task, like many things, it can quickly become complicated. That was certainly my experience when I was writing code to parse HTML pages. I began by using Perl5 regular expressions (regexps) occasionally. However, for reasons I'll explain, I ended up using them frequently.

Background

In my experience, most Java developers need to parse some kind of text. Usually, this means they initially spend some time working with Java string-related functions or methods like indexOf or substring, hoping that the input format never changes. However, if the input format changes, the code for reading the new format becomes more sophisticated and difficult to maintain. Eventually the code may need to support word wrapping, case sensitivity, and so on.

As the logic becomes more complex,maintenance becomes the drawback. Because any change may produce side effects and cause other parts of the text parser to stop working, the developer needs time to fix small bugs.

A developer with some Perl experience already probably has some experience working with regular expressions, too. If lucky (or good) enough, the developer is able to convince the rest of the team (or at least the team leader) to use this technology. Instead of lines of code with calls to String methods, the new approach implies delegating the core of the parser logic and replacing it with a regexp library.

By accepting the suggestion of the Perl5-experienced developer, the team must choose which regex implementation best suits their project. Then they need to learn how to use it.

After a short study of many alternatives found on the Internet, suppose the team decides to use one of the better-known libraries, such as Oro, belonging to the Jakarta project. Next, the parser is heavily refactored or almost rewritten and ends up using Oro's classes like Perl5Compiler, Perl5Matcher, and so on.

The consequences of this decision are clear:

  • The code is heavily coupled to Jakarta Oro's classes.

  • The team takes a risk by not knowing whether the non-functional requirements (like the performance or threading models) will be satisfied.

  • The team has spent time and money to learn and rewrite the code so it uses the regexp library. If their decision is wrong and a new library is chosen, this effort won't make a significant difference in cost, because the code will need to be rewritten again.

  • Even if the library works fine, they may decide that they should migrate to a brand new one (for example, the one included with JDK 1.4)?

Benefits of decoupling

Is there any way for the team to know which implementation best suits their needs, not only in the present but also in the future? Let's try to find an answer.

Avoid being dependent on any specific implementation

The previous story is quite common in software engineering. In some cases, such situations can produce high investment and long delays. This usually happens when a decision is made without knowing all the consequences and the decision maker is unlucky or lacks the required experience.

The situation can be summarized as follows:

  • You need a provider of some kind
  • You don't have an objective criteria to choose which is the best provider
  • You'd like to be able to evaluate all candidates with the minimum of cost
  • The decision shouldn't tie you to the chosen provider

The solution to this problem is to make your code more independent of the provider. This introduces a new layer -- one that decouples both the client and the provider.

In server-side developments, it's easy to find patterns or architectures that use this approach. To cite some examples:

  • With J2EE, you focus on building your application rather than application server details.
  • The Data Access Object (DAO) pattern hides the details and complexity of how you access the database (or the LDAP server, the XML file, and so on) as it provides a way to access an abstract persistence layer and you save the need to deal with database issues inside your client code, where the data is actually stored. It's not a Gang of Four (GoF) pattern, but part of Sun's J2EE best practices.

In the fictitious development team example, they're looking for a layer that:

  • Abstracts the concepts behind all regular expression implementations. The team could then focus on learning and understanding such concepts. What they learn would be applicable to any implementation or version.

  • Supports new libraries without side effects. Based on a plug-in architecture, the actual library that executes the regexp patterns is chosen dynamically and the adapters wouldn't be coupled. A new library would only introduce the need for a new adapter.

  • Provides a way to compare the different alternatives. A simple benchmark utility could show interesting performance measures. If they execute such a utility for each implementation, the team would get valuable information and could choose the best alternative.

Sounds good, but ...

Any decoupling approach has at least one disadvantage: What if the client code needs specific functionality provided by only one of the implementations? You cannot use any other implementation so you end up coupling your code to it. Maybe in the future that enhancement will be included, but there's little you can do for now.

Examples of this are not as rare as you might think. In the regexp world, there are some compiler options supported by only certain implementations. If your client code needs this kind of specific functionality, then this generic layer is not enough -- at least as it has been described so far.

Should the additional layer support all non-common functionalities of each implementation and throw exceptions if one that does not support it is chosen? That could be a solution, but it doesn't support the original goal of just defining the common abstract concepts.

One GoF pattern could fit perfectly in this situation: Chain of Responsibility. It introduces another indirection in the design. With this approach, the client code sends a message or a command to a list of entities able to process such a message. The list items are organized in a chain so the message is processed in order and can be consumed before reaching the end of the chain.

In this case, the specific functionalities that are only supported by certain implementations could be modeled by special types of messages. It's up to each item in the chain either to pass the message to the next one or not, depending on whether the items know about such functionalities.


Defining a common API

The API described here is called RegexpPlugin. It's been designed following the approach just discussed and supports decoupling between the regexp library and the code that uses it.

RegexpPlugin

In the following examples, I'll summarize the differences between using a concrete implementation (Jakarta Oro) and the RegexpPlugin API.

I start with a very simple regexp: Imagine that the text you have to parse is just the name of a person. The format you receive is something like John A. Smith and you only want to get the first name (John). But you don't know whether the words are separated by spaces, line breaks, tabs, or some combination of these. The regexp able to process such an input format is just .*\s*(.*?)\s+.* I'll explain step by step how to extract the information using this regexp.

The first part is the period asterisk characters, .*, which here means anything before any number of spaces and the group (.*?). The second part is of interest (because parentheses surround it). The question mark means take the first one that matches the condition.

What follows indicates any number of spaces, new lines, or tabs (\s), but at least one (+). The final period asterisk, .*, just represents the rest of the text (which isn't if interest).

So, this regexp is equivalent to: Take the first text that precedes a space. Let's write the Java code.

Hands on

To use regular expressions inside your Java code, you usually need to complete the following seven steps:

Step 1: Create a compiler instance. Using Jakarta Oro, you have to instantiate a Perl5Compiler:

org.apache.oro.text.regex.Perl5Compiler compiler =
    new org.apache.oro.text.regex.Perl5Compiler();

The equivalent code using the RegexpPlugin is similar:

org.acmsl.regexpplugin.Compiler compiler =
    org.acmsl.regexpplugin.RegexpManager.createCompiler();

There's a difference, though. As previously mentioned, this API hides which concrete implementation is actually used. You can choose a concrete one or leave the default Jakarta Oro. If the chosen library is not available at runtime, the RegexpPlugin API tries to create a compiler using its class name. If that operation fails, it sends the exception back to the client of the API.

Suppose that you always use JDK 1.4's built-in regexp classes. If so, there's no point including additional jar files that will never be used. That's why just invoking the createCompiler() method is not enough. You need to manage the exception that is thrown whenever the chosen library is not present. The example, then, has to be updated:

try
{
    org.acmsl.regexpplugin.Compiler compiler =
        org.acmsl.regexpplugin.RegexpManager.createCompiler();
}
catch (org.acmsl.regexpplugin.RegexpEngineNorFoundException exception)
{
    [..]
}

Step 2: Compile the regexp pattern. The regular expression itself is compiled into a Pattern object.

org.apache.oro.text.regex.Pattern pattern =
    compiler.compile(".*\\s*(.*?)\\s+.*", Perl5Compiler.MULTILINE_MASK);

Note that you have to escape the slash (\) characters.

This pattern object represents the regular expression that is defined in text format. Reuse pattern instances as much as possible. Then, if the regexp is fixed (lacking any variable part such as "(.*?)Tom.*"), the pattern should be a static member in the class.

The compile method is suitable for being configured with flags, like EXTENDED_MASK. (See Resources for a more detailed regexp tutorial.) However, RegexpPlugin doesn't allow arbitrary flags. The only supported ones are case sensitivity and multiline, because all supported libraries can handle them.

The compiler instance has specific properties to define such flags:

compiler.setMultiline(true);

org.acmsl.regexpplugin.Pattern pattern =
    compiler.compile(".*\\s*(.*?)\\s+.*");

Step 3: Create a Matcher object. In Jakarta Oro, this step is very straightforward:

org.apache.oro.text.regex.Perl5Matcher matcher =
    new org.apache.oro.text.regex.Perl5Matcher();

It's so simple because it doesn't need any information to be constructed. It'll become specific for your regexp later. Basically, in RegexpPlugin the step is more or less similar. Instead of creating the matcher yourself, you delegate it to the RegexpManager class:

org.acmsl.regexpplugin.Matcher matcher =
    org.acmsl.regexpplugin.RegexpManager.createMatcher();

The difference is that, as before, you need to deal with RegexpEngineNotFoundException. Actually, RegexpManager needs to create a matcher adapter for your chosen library or for the default one. If such classes are not available at runtime, it'll throw that exception.

Step 4: Evaluate the regular expression. The matcher object needs to interpret the regular expression and extract the information needed. This is done in a single line:

if (matcher.contains("John A. Smith", pattern))
{

If the input text matches the regular expression, what this method returns is true. The implicit side effect is that after this line the matcher object contains the first match found in the input text. The next step shows how to actually get the information you're interested in.

Using the RegexpPlugin API, there's no difference at all at this point.

Step 5: Retrieve the first match found. This simple step is done in only one line:

    org.apache.oro.text.regex.MatchResult matchResult = matcher.getMatch();

You declare a local variable to store the object that has the piece of text that matches the regexp. In both cases, the step is the same, except for the variable declaration (since one is an adapter of the other):

    org.acmsl.regexpplugin.MatchResult matchResult =
        matcher.getMatch();

Step 6: Get the group you're interested in. You can use two approaches:

  • A concrete library
  • The RegexpPlugin API

Since your regexp is .*\s*(.*?)\s+.*>, you have only one group: (.*?)

The MatchResult object contains all groups in an ordered list. You only need to know the position of the group that you want to get. Since this example only has one group, there's no doubt:

    String name = matchResult.group(1);

    [..]
}

The variable name now contains the text John, which is exactly what you wanted.

Step 7: Repeat the process if needed.Step 7: Repeat the process if needed. If the information you need can appear more than once, and you want to analyze all occurrences instead of just the first one, then you only have to include steps 5 through 7 in a loop, until the condition described in step 4 is not satisfied:

while (matcher.contains("John A. Smith", pattern))
{


Mappings

Besides writing a common abstract API, the main effort is actually to implement the adapters to some of the already existing regexp engines in the Java environment.

The following tables provide insights in how to migrate from one library to another. In some cases, the concepts are cleanly separated. In others, it's not so clear.

Regexp conceptGNU Regexp 1.2
Compiler gnu.regexp.RE
Pattern gnu.regexp.RE
Matcher gnu.regexp.REMatchEnumeration
gnu.regexp.RE
Match result gnu.regexp.REMatch
Malformed pattern exception gnu.regexp.REException
Regexp conceptJakarta Oro 2.0.6
Compiler org.apache.oro.text.regex.Perl5Compiler
Pattern org.apache.oro.text.regex.Pattern
Matcher org.apache.oro.text.regex.Perl5Matcher
Match result org.apache.oro.text.regex.MatchResult
Malformed pattern exception org.[..].regex.MalformedPatternException
Regexp conceptJakarta Regexp 1.3
Compiler org.apache.regexp.RE
org.apache.regexp.RECompiler
org.apache.regexp.REProgram
Pattern org.apache.regexp.REProgram
org.apache.regexp.RE
Matcher org.apache.regexp.RE
org.apache.regexp.REProgram
Match result org.apache.regexp.RE
Malformed pattern exception org.apache.regexp.RESyntaxException
Regexp conceptJDK 1.4 regex package
Compiler java.util.regex.Pattern
Pattern java.util.regex.Pattern
Matcher java.util.regex.Matcher
Match result java.util.regex.Matcher
Malformed pattern exception java.util.regex.PatternSyntaxException

Benchmarks

One of the clear uses of this API is to compare the differences between implementations, measuring performance, compatibility to Perl5 syntax, or other criteria.

The benchmarking utility developed for these tests uses an HTML parser to process Web content, updating information about links, forms, tables, and so on. However, what is important is that the parsing logic is delegated to regular expressions and therefore goes through to the RegexpPlugin API.

The benchmark consists of parsing a very simple HTML page 10,000 times. The results are shown in the following table.

Regexp libraryBenchmark result (seconds)
Jakarta Oro 2.0.6 130,71
Jakarta Regexp 1.2 23,261
GNU Regexp 1.1.4 1,966.939
JDK1.4 33,222

You can improve the performance in a real application in several ways. Most important is that, when you work with regexp libraries, you don't need to compile patterns every time. Instead, you can compile them and reuse the respective instances. However, if the regexp itself is not fixed, then the compilation process cannot be skipped.

Since the benchmark needs to switch between implementations to compare performance, compiled patterns must always be discarded to avoid interaction between libraries. However, as you can see, most of the evaluated libraries have similar response times, although a more detailed benchmark should give better insights into how each library behaves under different circumstances.


Summary

Regular expression parsers are powerful. Once a team becomes comfortable, the parsing logic improves, which helps reduce maintenance. However, developers need to know about regexp syntax to understand how such code works. This article has explained how to use one of the libraries in a very simple example. Beyond that, it also described the advantages of using an additional layer to decouple client code from the regexp engine itself.


Resources

About the author

Jose San Leandro Armendariz is an experienced software engineer who has worked on a number of J2EE projects over the past few years. You can contact Jose at jsanleandro@yahoo.es.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Java technology
ArticleID=10726
ArticleTitle=Build an abstract Java API for regular expressions
publish-date=12012002
author1-email=jsanleandro@yahoo.es
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).