A growing database can be a good news/bad news thing. On the one hand, a database that's getting bigger is something to smile about if it reflects growth in a company's business: more sales, more customers, more prospects. On the other hand, bigger can be bad when the increase in data volume occurs largely as a result of records going into the database and never coming out.
There comes a time when old rows in a table—particularly in tables that record business events such as sales transactions or insurance claims or stock trades—are just taking up space. They are rarely, if ever, requested by applications, so they provide little in the way of value. What you need to do is get them out of the database (without necessarily getting rid of them entirely; more on that momentarily). If you leave them where they are, they'll cost you—big time.
The high cost of old and cold data
I sometimes think of records in a database as having a temperature. Those that are frequently retrieved or updated by users or applications are hot. Those that are never or rarely retrieved or changed are cold. Generally speaking, the older a record gets, the colder it gets (so-called code or reference tables, which tend to be very static in nature, are an exception to this rule). A buildup of cold data in your database can cost you in several ways:
- More expensive SQL (part 1): The top determinant of an IBM DB2 application workload's CPU cost is the number of pages that DB2 must access in processing the workload (these accesses are known as
GETPAGEs in a DB2 for z/OS environment and as logical reads in a DB2 for Linux, UNIX, and Windows environment). If data in a table is not clustered by date—if, for example, it is clustered by customer ID or order number—the percentage of rows in each page that are relatively cold will increase over time if old rows are not periodically removed from the database. That, in turn, can drive up page accesses (and, therefore, the CPU cost of SQL statement execution), because now DB2 must look in more pages to retrieve the same number of hot rows (and remember, these are the rows that are most frequently requested by users).
- More expensive SQL (part 2): You might be thinking, "That more-expensive SQL problem is not one that I face, because my business-event-recording tables are clustered by date." It's true that such a clustering scheme will concentrate cold rows at one end of the table, while pages at the other end of the table are dense with hot rows; however, there's the little matter of index access to consider. As you probably know, a DB2 index is logically structured like an upside-down tree, with a root page at the top, leaf pages with index key values and corresponding row IDs at the bottom and, unless the table is rather small, some number of levels (usually one or two) of non-leaf pages in between. When a table grows because cold rows are never deleted, eventually an n-level index on the table will become an n+1-level index. When a three-level index goes to four levels, the number of page accesses required to retrieve a row by way of an index goes up by 25 percent, from four (index root page, non-leaf page, leaf page, and table page) to five (an additional non-leaf index page).
- More I/Os, longer run times: As table pages become ever more full of cold rows (and index leaf pages with cold key values), the effectiveness of the buffer pool configuration goes down unless buffer pools are enlarged accordingly—another cost. Fewer buffer pool "hits" mean more I/O operations, and that increases SQL statement run times.
- More expensive utilities: Utility processes that are executed on a regular basis, such as those that back up and reorganize database objects and update catalog statistics used for query optimization, consume more CPU time as tables and indexes grow.
- More disk space: Obviously, the larger the database, the greater the amount of disk space required to hold the database objects. The cost per megabyte of disk storage is less than it used to be, but no one's giving the stuff away.
So rows in your database that are never, or hardly ever, retrieved or acted upon by programs increase the cost of your data-serving system on multiple fronts. Getting those cold rows out of the database would be a good thing, but don't take an ax to the problem. You'll need to think it through carefully.
From the database to where? And when? And for how long?
A well-thought-out data archive and purge strategy can be a big help in maintaining the cost efficiency of your operational, run-the-business database systems. But cost control is not the whole story—there are also implications for regulatory compliance. The bottom line is that data archiving and purging can deliver a multitude of benefits if you do it right, and doing it right means effectively addressing several key questions, including:
- When should data records be removed from operational databases? Sometimes, this is a relatively easy question to answer, such as when the age of a record is the sole criterion for deletion from a table. For example, you might have an application that provides users with detailed account activity information over the past 12 or 24 months. In other cases, purge criteria can be more complex. It may be that a row recording a transaction cannot be deleted from a table if a certain related business event—perhaps recorded in another table—has or has not occurred. Establishing proper purge criteria requires a thorough understanding of the business processes associated with the data in question.
- One-step or two-step removal? It may be that an old, cold record is taken right out of the database at the appropriate time. Alternatively, a record may be moved, but not deleted, when it goes from hot to warm (accessed occasionally, but not frequently). Suppose 90 percent of requests for data in a table are for rows that have been inserted within the last 3 months, but you must support an application requirement for 12-month history retrieval. Given such a scenario, an organization may opt to keep the most recent 3 months of data in one table and data for the succeeding 9 months in another table—an arrangement that could help to optimize the performance of the "90 percent" data requests.
- Into the bit bucket, or into an archive? When you remove a record from an operational database, can it truly be thrown away, or must it be retained for legal reasons? If the latter, how long must it be retained?
- What about the physical location? Typically, access to archived data will be infrequent. That being the case, does it make sense to store archived data on expensive, high-performance disk systems? A better fit would likely be a higher-density storage device that provides online accessibility with reasonable performance. And while you're at it, wouldn't it be nice if the same data archive could be used by different source database systems running on multiple hardware/OS platforms? Infrastructure simplification, anyone?
- What about security? Your organization has probably taken pains to ensure that data in your operational databases is properly secured, protecting it from unauthorized access. Will protection be similarly robust for archived data? It had better be, or good luck passing that audit.
- How about usability? OK, so you have a place where a record removed from the operational database can be economically and securely stored for a long time. Suppose someone needs to retrieve that record? Can it be quickly and easily located? Can the information in the record be presented in a usable format? Is there any metadata (data about the data) associated with the record that could make it more readily usable?
That's a pretty good list, there. How are you going to tackle it?
Build or buy?
It wasn't so long ago that do-it-yourself was a common organizational approach to data archive and purge (if, indeed, an organization did anything at all about it). Then came a truckload of legislated data retention and protection rules, a lot of mergers and acquisitions, more complex criteria concerning the deletion of data from operational databases, and initiatives around simplifying and rationalizing IT infrastructure. You can still go the roll-your-own route, but that order has become a lot taller of late.
Want to explore vendor offerings? The demand for robust and flexible data archive and purge solutions has increased considerably in recent years, and a number of companies compete for that business (IBM is a big player in the market, with its Optim line of information life cycle management products). An investment in a packaged archive and purge solution might offer an attractive payback for your organization.
Get your data right-placed
Is your database like a hotel where records check in, but never check out? That situation can be pretty sweet for the people who sell you servers and storage, but it sure doesn't help your company's bottom line. Get a handle on your data, figure out what doesn't belong in your operational databases, and implement a process that will move the cold data to a safe, secure, and economical archive. You've taken care of those old records for a long time. It's time they got a place of their own.
Dig deeper into Information management on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.