Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Challenges with legacy data

Knowing your data enemy is the first step in overcoming it

Scott W. Ambler, Prectice Leader, Agile Development, Rational Methods Group, IBM, Software Group
Scott W. Ambler is a Practice Leader for Agile Development within the IBM Methods group. He develops process materials, speaks at conferences, and works with IBM clients worldwide to help improve their software processes. Scott is author of several books, listed on his Web site at www.ambysoft.com. Scott is also a recognized Ratonal Thought Leader, whose homepage may be viewed here.

Summary:  Most developers have to live with questionable legacy data designs and work around the problems they present. Knowing what problems to look for is the first step in overcoming them. In this tip, Scott W. Ambler helps you recognize typical problems that you're likely to encounter with legacy data, and he helps you understand their impact.

Date:  01 Jul 2001
Level:  Introductory

Comments:  

When you are developing a new system using object-oriented technologies such as Enterprise JavaBeans (EJB), C++, or C#, you are sometimes in a position to develop your data schema from scratch. If so, consider yourself to be among the lucky few. The vast majority of developers are forced to tolerate an existing legacy design that is often difficult, if not impossible, to change because changes in the existing legacy design would necessitate corresponding changes to the legacy applications that access it. The problem your legacy database presents is often too difficult to fix immediately, so you have to learn to work around it. This is true whether you are following a heavyweight software process such as the Enterprise Unified Process (EUP), or an agile one such as eXtreme Programming (XP) or Agile Modeling (AM).

How do you learn to live with a legacy data design? The first step is to understand the scope of the challenge you are facing. Start by identifying and understanding the impact of typical data-related problems that you'll encounter with legacy data. Table 1 lists the most common data problems that you'll encounter and summarizes their potential impact on your application. It is important to note that you are likely to experience several of these problems in any given database, and that any given table or even column within the database may exhibit the problems.

Table 1. Typical legacy data problems

Problem Example Potential impact
A single column being used for several purposesAdditional information for an inventory item is stored in the Notes column. Additional information will be one or more of: a lengthy description of the item, storage requirements, or safety requirements when handling the item.
  • One or more attributes of your objects may need to be mapped to this field, requiring a complex parsing algorithm to determine the proper usage of the column.
  • Your objects may be forced to implement a similar attribute instead of implementing several attributes as your design originally described.
The purpose of a column is determined by the value of one or more other columns If the value of DateType is 17, then PersonDate represents the date of birth of the person. If the value is 84, then PersonDate is the person's date of graduation from high school. If the value is between 35 and 48, then it is the date the person entered high school.
  • A potentially complex mapping is required to work with the value stored in the column.
Incorrect data values The AgeInYears column for a person contains the value -3. Or the AgeInYears column contains 7 although the BirthDate is August 14 1967 and the current date is October 10 2001.
  • Your objects will need to implement validation code to ensure that their base data values are correct.
  • Strategies to replace incorrect values may need to be defined and implemented.
  • An error-handling strategy will need to be developed to deal with bad data. This may include logging of the error, attempting to fix the error, or dropping the data from processing until the problem is corrected.
Inconsistent/incorrect data formatting The name of a person is stored in one table in the format "Firstname Surname" and in another table in the format "Surname, Firstname".
  • Parsing code will be required to retrieve and store the data as appropriate.
Missing dataThe date of birth of a person has not been recorded in some records.
  • See strategies for dealing with incorrect data values.
Missing columnsYou need a middle name of a person but a column for it does not exist.
  • You may need to add the column to the existing legacy schema.
  • You may need to do without the data.
  • Identify a default value until the data is available.
  • An alternate source for the data may need to be found.
Additional columnsThe Social Security number for a person is stored in the database and you don't need it.
  • For columns that are required for other applications you may be required to implement them in your objects to ensure the other applications can use the data your application generates.
  • You may need to write the appropriate default value to the database when inserting a new record.
  • For database updates, you may need to read the original value and then write it back out again.
Multiple sources for the same data Customer information is stored in three separate legacy databases.
  • Identify a single source for your information and use only that source.
  • Be prepared to access multiple sources for the same information.
  • Identify rules for choosing a preferred source when you discover the same information is stored in several places.
Important entities, attributes, and relationships hidden and floating in text fields A notes text field contains the information ("Clark and Lois Kent, Daily Planet Publications").
  • Develop code to parse the information from the fields.
  • Do without the information.
Data values that stray from their field descriptions and business rules The maiden name column is being used to store a person's fabric preference for clothing.
  • You need to update the documentation to reflect the actual usage.
  • Developers who took the documentation at face value may need to update their code.
  • Data analysis should be performed to determine the exact usage in case different applications are using the field for different purposes.
Various key strategies for the same type of entity One table stores customer information using the Social Security number as the key, another uses the ClientID as the key, and another uses a surrogate key.
  • You need to be prepared to access similar data via several strategies, implying the need for similar finder operations in some classes.
  • Some attributes of an object may be immutable, their value cannot be changed, because they represent part of a key in your relational database.
Unrealized relationships between data recordsA customer has a summer home. Both pieces of data are recorded in your database, but there is no relationship stored in the database regarding this fact.
  • Data may be inadvertently replicated, eventually a new address record is inadvertently created (and the relationship now defined) for the summer home even though one already exists.
  • Additional code may need to be developed to detect potential problems. Procedures for handling the problems will also be required.
One attribute is stored in several fields The Person class requires a single name field whereas it is stored in the columns FirstName and Surname in your database.
  • Potentially complex parsing code may be required to retrieve and then save the data.
Inconsistent use of special charactersA date uses hyphens to separate the year, month, and day whereas a numerical value stored as a string uses hyphens to indicate negative numbers.
  • Complexity of parsing code increases.
  • Additional documentation required to indicate character usage.
Different data types for similar columnsA customer ID is stored as a number in one table and a string in another.
  • You may need to decide how you want the data to be handled by your objects, and then transform it to/from your data source(s) as appropriate.
  • If foreign data have a different type than original data they represent, then table joins (and hence any SQL embedded in your objects) become more difficult.
Different levels of detail An object requires the total sales for the month but your database stores individual totals for each order, or an object requires the weight of individual components of an item, such as the doors and engine of a car, but your database only records the aggregate weight.
  • Potentially complex mapping code may be required to resolve the various levels of detail.
Different modes of operation Some data is a read-only snapshot of information whereas other data is read write.
  • The design of your objects must reflect the nature of the data they are mapped to. Objects based on read-only data therefore cannot update or delete it.
Varying timeliness of data The Customer data is current, Address data is one day out of date, and the data pertaining to countries and states is accurate to the end of the previous quarter because you purchase that information from an external source.
  • Your object code must reflect, and potentially report to their clients, the timeliness of the information that they are based on.
Varying default valuesYour object uses a default of Green for a given value yet another application has been using Yellow, resulting in a preponderance (in the opinion of your users) of Yellow values stored in the database.
  • You may need to negotiate a new default value with your users.
  • You may not be allowed to store your default value (for example, Green is an illegal value in the database).
Various representations The day of the week is stored as T, Tues, 2, and Tuesday in four separate columns.
  • Translation code back and forth between a common value that your object(s) use will need to be developed.

In the next tip in this three-tip series I'll compare and contrast common design problems you are likely to encounter, and in the third tip I'll explore potential solutions and tools to address those problems.

Note: This tip was modified from the Mastering Enterprise JavaBeans 2/e, to be published in autumn of 2001.


Resources

About the author

Scott W. Ambler is a Practice Leader for Agile Development within the IBM Methods group. He develops process materials, speaks at conferences, and works with IBM clients worldwide to help improve their software processes. Scott is author of several books, listed on his Web site at www.ambysoft.com. Scott is also a recognized Ratonal Thought Leader, whose homepage may be viewed here.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and web services
ArticleID=86963
ArticleTitle=Challenges with legacy data
publish-date=07012001
author1-email=scott_ambler@ca.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).