Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Diagnosing Java Code: The Saboteur Data bug pattern

Hidden data bombs may be the key to odd crashes

Eric Allen (eallen@cyc.com), Lead Java Developer, Cycorp, Inc.
Eric Allen has an A.B. in computer science and mathematics from Cornell University. He is currently the lead Java software developer at Cycorp, Inc., and a part-time graduate student in the programming languages team at Rice University. His research concerns the development of formal semantic models of Java, and extensions of Java, both at the source and bytecode levels. Currently, he is implementing a source-to-bytecode compiler for the NextGen programming language, an extension of Java with generic run-time types. Contact Eric at eallen@cyc.com.

Summary:  When a program crashes due to corrupted data, the saboteur can be elusive. Often a program can crash dead in its tracks while manipulating its own internal data, even after working flawlessly for long periods. This article discusses a bug pattern that can be the culprit of this sort of crash, why it exists, and several methods to eliminate it -- before and after it occurs.

View more content in this series

Date:  01 May 2001
Level:  Introductory

Comments:  

One in a million

Being the industrious developer that you are, you've deployed one of your well-written and well-tested applications for several of your clients who needed better access to their complex, massive stores of data.

For each client, the on-site testing period went off without a hitch. You're on your way to the bank, only barely thinking about the six-month software checkup, when your pager goes off. Using your software, one of your clients ran a report and bombed the system.

You rush to the site and run a random test. It works fine. You run another. No problems. You run hundreds more. Still no problem. You check on other clients who have been running this application full-tilt for six months. You get no complaints.

You repeat the report that caused the problem. Crash! What's going on?


The Saboteur Data bug pattern

Many programs need to intensively access and manipulate internally stored data to perform various complex tasks. This data might be retrieved from a large structure in memory, a database, or over a network.

This type of program is highly susceptible to a crash caused by corrupt internal data. I call this bug pattern the Saboteur Data pattern because such data can stay in the system indefinitely, much like Cold War sleeper spies, causing no trouble until the particular bit of data is accessed. The corrupt data then explodes like a bomb.


A syntactic cause

Suppose we have a JDBC application that stores a database table called Mapping that maps String names to serializations of sets of elements. (See Resources for more information on the JDBC API.) Each element of each set refers to a key stored in another table, Properties, containing various known properties of these elements.

Let's say that both Mapping and Properties are initially read from a text file developed by an outside source (outside meaning any data source not generated internally), where each line starts with a name and is followed by a representation of the corresponding set, as follows:


Listing 1. A sample, outside-source text file
In the Mapping file:

apples {macintosh, gala, golden-delicious}
trees  {elm, beech, maple, pine, birch}
rocks  {quartz, limestone, marble, diamond}
...

In the Properties file:

macintosh {color: red, taste: sour}
gala      {color: red, taste: sweet}
diamond   {color: clear, rigidity: hard, value: high}
...

The Mapping and Properties table entries could be parsed and passed into a method that inserts them into a database. But there are potential pitfalls in this approach. For example, let's suppose that we have written a class that handles a JDBC-compliant database. Following the JDBC API, we could define a PreparedStatement object and use it to pass information into the database, as follows:


Listing 2. Using a StreamTokenizer to insert domain and range strings
  
  ...

  PreparedStatement insertionStmt = 
    con.prepareStatement("INSERT INTO MAPPING VALUES(?,?)");

  ...

  public void insertEntry(String domain, String range) 
      throws SQLException {
        insertionStatement.setString(1, domain);
        insertionStatement.setString(2, range);
        insertionStatement.executeUpdate();
      }

Inserting two Strings this way may or may not be all right, depending on how the Strings are obtained from the text file. Suppose, for example, that a simple regular expression-matching tool were used to split each line into two Strings:

  • One String contains all the characters before the first String.
  • One String contains all characters after the first String.

Such a rudimentary parse of the text file would not catch minor corruption in the data. For example, if one of the lines were in the following form:


Listing 3. A data saboteur
trees  {elm, beech, maple, pine birch}

The comma between "pine" and "birch" is missing. An error such as this can easily result from a bug in the tool that generates the file or from manual editing of the file.

At any rate, the data would enter the database in its corrupt form, waiting silently to be accessed. If the method used to access data expects entries to be separated by commas and spaces, it will crash when reading this entry.

If the program simply distinguishes the elements of the set by commas alone, an even more serious error can occur. The system could interpret "pine birch" as a single type of tree (a single entry of data) and propagate the bug further into the computation.


A semantic cause

Our example is one in which a simple, syntactic feature of the data was violated. Of course, that's not the only way in which the data might be corrupted.

Semantic-level constraints can be violated as well. In our example, one expectation of the data in the Mapping table is that every element in each set is a domain entry in the Properties table. If this invariant was violated, we might end up trying to read an element in the Properties table that wasn't there, causing an exception to be thrown.

In this article I use database entries as examples, but a Saboteur Data bug can come at you in a variety of ways -- as many ways as there are data-input avenues. When data is read by a program, whether it is from a file, a keyboard, a microphone, a network, or a power glove, the potential for a saboteur exists.


Cures and preventions

The best defense against the Saboteur Data bug is one that is universally employed by the compiler and interpreter developers. Because the input data to these programs is so complex, developers have no choice but to perform as thorough an integrity check as possible when first reading the input, rather than upon later access.

Parsing as an elimination method
In fact, the very practice of parsing input is a way of eliminating many of these bugs. Unfortunately, many programmers -- who would never think of writing a compiler without a parser -- fail to write adequate parsing methods for simpler data. The parsing of simpler data is, of course, easier, but that's no excuse for not parsing it at all.

Any program that reads data -- no matter how simple -- should parse it. After all, such a program can be viewed as a compiler (or an interpreter) over the "language" defined by its set of valid inputs.

Take it from someone who has been there. In my young and reckless days, I was guilty of manipulating data without proper parsing, and I suffered the consequences -- rampant saboteurs. I don't recommend the experience.

Type checking as an elimination method
Another common form of checking done by compilers for many languages (including, of course, the Java language) is type checking. Type checking is an example of a semantic-level check on the integrity of a program.

Provided that the type system is sound (as is the Java type system), this integrity check literally guarantees that a huge class of errors can never occur at run time. Like parsing, this example from compiler writers can be applied to other programs, which often stipulate semantic-level invariants over their input data (as in our example). These invariants are often not explicit, but they can be made explicit by putting in the corresponding checks.

Iteration as an elimination method
Of course, if you suspect an occurrence of this bug pattern with data that has been read in and stored already, it would be prudent to iterate over the data, accessing each datum as it would be in the deployed application, and ensuring that everything works as expected. In the process, you might be able to correct simple errors as well.

A caveat on elimination methods
I certainly don't mean to imply that it is always possible to perform enough checks to eliminate all saboteur data from a program. If that were the case, it wouldn't be a potentially problem-causing bug pattern.

There are many reasons why a saboteur might be undetectable before it starts wreaking havoc:

  • The data necessary to perform all the checks is not available until after the saboteurs are stored away.
  • The complete set of constraints is not even computable (as is the case for compilers and interpreters).
  • The constraints are computable, but the resources required to check them are beyond the access of the program.

In such cases, the best we can do is eliminate as many possible forms of saboteurs as possible.


Wrapup

Here is the summary of the Saboteur Data bug pattern:

  • Pattern: Saboteur Data
  • Symptoms: A program that stores and manipulates complex input data crashes unexpectedly while performing a task similar to other tasks that cause no problem.
  • Cause: Some of the internal data is corrupted, either the syntax or the semantics.
  • Cures and preventions: Perform as many integrity checks on input data as possible, as early as possible. For persistent data that is already corrupt, walk over it and check for integrity.

The Golden Rule to eliminating data saboteurs: Any program that reads data should parse the data. Good luck in stamping them out!


Resources

  • For more information on JDBC, check out the tutorial, "Managing database connections with JDBC" (developerWorks, November 2001).



  • Take the Java debugging tutorial (developerWorks, February 2001) for help with general debugging techniques.



  • Visit the Enhydra home page to download Instant DB, a free JDBC driver. Although it takes some tricks to make it scale to large databases, it handles moderately sized tables quite well.



  • The Java language has many analogues to Lex and Yacc, the standard parser-generator tools. Check out ANTLR for one such example.



  • Read Eric's complete series on bug patterns.

About the author

Eric Allen has an A.B. in computer science and mathematics from Cornell University. He is currently the lead Java software developer at Cycorp, Inc., and a part-time graduate student in the programming languages team at Rice University. His research concerns the development of formal semantic models of Java, and extensions of Java, both at the source and bytecode levels. Currently, he is implementing a source-to-bytecode compiler for the NextGen programming language, an extension of Java with generic run-time types. Contact Eric at eallen@cyc.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=10532
ArticleTitle=Diagnosing Java Code: The Saboteur Data bug pattern
publish-date=05012001
author1-email=eallen@cyc.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).