When you are developing a new system using object-oriented technologies such as Enterprise JavaBeans (EJB), C++, or C#, you are sometimes in a position to develop your data schema from scratch. If so, consider yourself to be among the lucky few. The vast majority of developers are forced to tolerate an existing legacy design that is often difficult, if not impossible, to change because changes in the existing legacy design would necessitate corresponding changes to the legacy applications that access it. The problem your legacy database presents is often too difficult to fix immediately, so you have to learn to work around it. This is true whether you are following a heavyweight software process such as the Enterprise Unified Process (EUP), or an agile one such as eXtreme Programming (XP) or Agile Modeling (AM).
How do you learn to live with a legacy data design? The first step is to understand the scope of the challenge you are facing. Start by identifying and understanding the impact of typical data-related problems that you'll encounter with legacy data. Table 1 lists the most common data problems that you'll encounter and summarizes their potential impact on your application. It is important to note that you are likely to experience several of these problems in any given database, and that any given table or even column within the database may exhibit the problems.
Table 1. Typical legacy data problems
| Problem | Example | Potential impact |
| A single column being used for several purposes | Additional information for an inventory item is stored in the Notes column. Additional information will be one or more of: a lengthy description of the item, storage requirements, or safety requirements when handling the item. |
|
| The purpose of a column is determined by the value of one or more other columns | If the value of DateType is 17, then PersonDate represents the date of birth of the person. If the value is 84, then PersonDate is the person's date of graduation from high school. If the value is between 35 and 48, then it is the date the person entered high school. |
|
| Incorrect data values | The AgeInYears column for a person contains the value -3. Or the AgeInYears column contains 7 although the BirthDate is August 14 1967 and the current date is October 10 2001. |
|
| Inconsistent/incorrect data formatting | The name of a person is stored in one table in the format "Firstname Surname" and in another table in the format "Surname, Firstname". |
|
| Missing data | The date of birth of a person has not been recorded in some records. |
|
| Missing columns | You need a middle name of a person but a column for it does not exist. |
|
| Additional columns | The Social Security number for a person is stored in the database and you don't need it. |
|
| Multiple sources for the same data | Customer information is stored in three separate legacy databases. |
|
| Important entities, attributes, and relationships hidden and floating in text fields | A notes text field contains the information ("Clark and Lois Kent, Daily Planet Publications"). |
|
| Data values that stray from their field descriptions and business rules | The maiden name column is being used to store a person's fabric preference for clothing. |
|
| Various key strategies for the same type of entity | One table stores customer information using the Social Security number as the key, another uses the ClientID as the key, and another uses a surrogate key. |
|
| Unrealized relationships between data records | A customer has a summer home. Both pieces of data are recorded in your database, but there is no relationship stored in the database regarding this fact. |
|
| One attribute is stored in several fields | The Person class requires a single name field whereas it is stored in the columns FirstName and Surname in your database. |
|
| Inconsistent use of special characters | A date uses hyphens to separate the year, month, and day whereas a numerical value stored as a string uses hyphens to indicate negative numbers. |
|
| Different data types for similar columns | A customer ID is stored as a number in one table and a string in another. |
|
| Different levels of detail | An object requires the total sales for the month but your database stores individual totals for each order, or an object requires the weight of individual components of an item, such as the doors and engine of a car, but your database only records the aggregate weight. |
|
| Different modes of operation | Some data is a read-only snapshot of information whereas other data is read write. |
|
| Varying timeliness of data | The Customer data is current, Address data is one day out of date, and the data pertaining to countries and states is accurate to the end of the previous quarter because you purchase that information from an external source. |
|
| Varying default values | Your object uses a default of Green for a given value yet another application has been using Yellow, resulting in a preponderance (in the opinion of your users) of Yellow values stored in the database. |
|
| Various representations | The day of the week is stored as T, Tues, 2, and Tuesday in four separate columns. |
|
In the next tip in this three-tip series I'll compare and contrast common design problems you are likely to encounter, and in the third tip I'll explore potential solutions and tools to address those problems.
Note: This tip was modified from the Mastering Enterprise JavaBeans 2/e, to be published in autumn of 2001.
- Read about the object-data divide and how to overcome it in your EJB projects in two of Scott Ambler's tips from April 2001:
- Learn more about Agile Modeling and eXtreme Programming:
Scott W. Ambler is a Practice Leader for Agile Development within the IBM Methods group. He develops process materials, speaks at conferences, and works with IBM clients worldwide to help improve their software processes. Scott is author of several books, listed on his Web site at www.ambysoft.com. Scott is also a recognized Ratonal Thought Leader, whose homepage may be viewed here.