Being the industrious developer that you are, you've deployed one of your well-written and well-tested applications for several of your clients who needed better access to their complex, massive stores of data.
For each client, the on-site testing period went off without a hitch. You're on your way to the bank, only barely thinking about the six-month software checkup, when your pager goes off. Using your software, one of your clients ran a report and bombed the system.
You rush to the site and run a random test. It works fine. You run another. No problems. You run hundreds more. Still no problem. You check on other clients who have been running this application full-tilt for six months. You get no complaints.
You repeat the report that caused the problem. Crash! What's going on?
Many programs need to intensively access and manipulate internally stored data to perform various complex tasks. This data might be retrieved from a large structure in memory, a database, or over a network.
This type of program is highly susceptible to a crash caused by corrupt internal data. I call this bug pattern the Saboteur Data pattern because such data can stay in the system indefinitely, much like Cold War sleeper spies, causing no trouble until the particular bit of data is accessed. The corrupt data then explodes like a bomb.
Suppose we have a JDBC application that stores a database table called Mapping that maps String names to serializations of sets of elements. (See Resources for more information on the JDBC API.) Each element of each set refers to a key stored in another table, Properties, containing various known properties of these elements.
Let's say that both Mapping and Properties are initially read from a text file developed by an outside source (outside meaning any data source not generated internally), where each line starts with a name and is followed by a representation of the corresponding set, as follows:
Listing 1. A sample, outside-source text file
In the Mapping file:
apples {macintosh, gala, golden-delicious}
trees {elm, beech, maple, pine, birch}
rocks {quartz, limestone, marble, diamond}
...
In the Properties file:
macintosh {color: red, taste: sour}
gala {color: red, taste: sweet}
diamond {color: clear, rigidity: hard, value: high}
...
|
The Mapping and Properties table entries could be parsed and passed into a method that inserts them into a database. But there are potential pitfalls in this approach. For example, let's suppose that we have written a class that handles a JDBC-compliant database. Following the JDBC API, we could define a PreparedStatement object and use it to pass information into the database, as follows:
Listing 2. Using a StreamTokenizer to insert domain and range strings
...
PreparedStatement insertionStmt =
con.prepareStatement("INSERT INTO MAPPING VALUES(?,?)");
...
public void insertEntry(String domain, String range)
throws SQLException {
insertionStatement.setString(1, domain);
insertionStatement.setString(2, range);
insertionStatement.executeUpdate();
}
|
Inserting two Strings this way may or may not be all right, depending on how the Strings are obtained from the text file. Suppose, for example, that a simple regular expression-matching tool were used to split each line into two Strings:
- One
Stringcontains all the characters before the firstString. - One
Stringcontains all characters after the firstString.
Such a rudimentary parse of the text file would not catch minor corruption in the data. For example, if one of the lines were in the following form:
Listing 3. A data saboteur
trees {elm, beech, maple, pine birch}
|
The comma between "pine" and "birch" is missing. An error such as this can easily result from a bug in the tool that generates the file or from manual editing of the file.
At any rate, the data would enter the database in its corrupt form, waiting silently to be accessed. If the method used to access data expects entries to be separated by commas and spaces, it will crash when reading this entry.
If the program simply distinguishes the elements of the set by commas alone, an even more serious error can occur. The system could interpret "pine birch" as a single type of tree (a single entry of data) and propagate the bug further into the computation.
Our example is one in which a simple, syntactic feature of the data was violated. Of course, that's not the only way in which the data might be corrupted.
Semantic-level constraints can be violated as well. In our example, one expectation of the data in the Mapping table is that every element in each set is a domain entry in the Properties table. If this invariant was violated, we might end up trying to read an element in the Properties table that wasn't there, causing an exception to be thrown.
In this article I use database entries as examples, but a Saboteur Data bug can come at you in a variety of ways -- as many ways as there are data-input avenues. When data is read by a program, whether it is from a file, a keyboard, a microphone, a network, or a power glove, the potential for a saboteur exists.
The best defense against the Saboteur Data bug is one that is universally employed by the compiler and interpreter developers. Because the input data to these programs is so complex, developers have no choice but to perform as thorough an integrity check as possible when first reading the input, rather than upon later access.
Parsing as an elimination method
In fact, the very practice of parsing input is a way of eliminating many of these bugs. Unfortunately, many programmers -- who would never think of writing a compiler without a parser -- fail to write adequate parsing methods for simpler data. The parsing of simpler data is, of course, easier, but that's no excuse for not parsing it at all.
Any program that reads data -- no matter how simple -- should parse it. After all, such a program can be viewed as a compiler (or an interpreter) over the "language" defined by its set of valid inputs.
Take it from someone who has been there. In my young and reckless days, I was guilty of manipulating data without proper parsing, and I suffered the consequences -- rampant saboteurs. I don't recommend the experience.
Type checking as an elimination method
Another common form of checking done by compilers for many languages (including, of course, the Java language) is type checking. Type checking is an example of a semantic-level check on the integrity of a program.
Provided that the type system is sound (as is the Java type system), this integrity check literally guarantees that a huge class of errors can never occur at run time. Like parsing, this example from compiler writers can be applied to other programs, which often stipulate semantic-level invariants over their input data (as in our example). These invariants are often not explicit, but they can be made explicit by putting in the corresponding checks.
Iteration as an elimination method
Of course, if you suspect an occurrence of this bug pattern with data that has been read in and stored already, it would be prudent to iterate over the data, accessing each datum as it would be in the deployed application, and ensuring that everything works as expected. In the process, you might be able to correct simple errors as well.
A caveat on elimination methods
I certainly don't mean to imply that it is always possible to perform enough checks to eliminate all saboteur data from a program. If that were the case, it wouldn't be a potentially problem-causing bug pattern.
There are many reasons why a saboteur might be undetectable before it starts wreaking havoc:
- The data necessary to perform all the checks is not available until after the saboteurs are stored away.
- The complete set of constraints is not even computable (as is the case for compilers and interpreters).
- The constraints are computable, but the resources required to check them are beyond the access of the program.
In such cases, the best we can do is eliminate as many possible forms of saboteurs as possible.
Here is the summary of the Saboteur Data bug pattern:
- Pattern: Saboteur Data
- Symptoms: A program that stores and manipulates complex input data crashes unexpectedly while performing a task similar to other tasks that cause no problem.
- Cause: Some of the internal data is corrupted, either the syntax or the semantics.
- Cures and preventions: Perform as many integrity checks on input data as possible, as early as possible. For persistent data that is already corrupt, walk over it and check for integrity.
The Golden Rule to eliminating data saboteurs: Any program that reads data should parse the data. Good luck in stamping them out!
- For more information on JDBC, check out the tutorial, "Managing database connections with JDBC" (developerWorks, November 2001).
- Take the Java debugging tutorial (developerWorks, February 2001) for help with general debugging techniques.
- Visit the Enhydra home page to download Instant DB, a free JDBC driver. Although it takes some tricks to make it scale to large databases, it handles moderately sized tables quite well.
- The Java language has many analogues to Lex and Yacc, the standard parser-generator
tools. Check out ANTLR for one such
example.
- Read Eric's complete series on bug patterns.
Eric Allen has an A.B. in computer science and mathematics from Cornell University. He is currently the lead Java software developer at Cycorp, Inc., and a part-time graduate student in the programming languages team at Rice University. His research concerns the development of formal semantic models of Java, and extensions of Java, both at the source and bytecode levels. Currently, he is implementing a source-to-bytecode compiler for the NextGen programming language, an extension of Java with generic run-time types. Contact Eric at eallen@cyc.com.