The first principle of effective data governance is to understand your data. The better you understand the structure and nature of your data, the more effective and flexible you can be in governing it. This article deals specifically with the knowledge you need to have before you can effectively execute test data management, data privacy, application retirement, and data growth projects with InfoSphere Optim.
The article series is presented in two parts. Part 1 outlines what InfoSphere Optim needs to know about your data and why it needs to know it, and provides important introductory material on the concept of a complete business object. Those new to Optim will find this useful to help justify the processes and tools explained in Part 2. For more advanced readers, Part 2 in this series will survey the processes and tools available to help you find the information if it is unknown, and may be useful in serving as an overview of the current IBM tools available for business object creation and relationship discovery. This is an area of InfoSphere Optim which is constantly evolving. Part 2 will discuss the products currently available at the time of publication.
The concept of the complete business object
InfoSphere Optim processes operate on a fundamental metadata definition
called a complete business object. Complete business objects are simply
lists of tables and the relationships between those tables. They normally
define a single business process. For example, one of the most commonly
implemented complete business objects for PeopleSoft deployments is the
payroll business object because PeopleSoft is a
Human Resources Information System (HRIS) and payroll is one of the most
data intensive processes for HRIS.
Normally a single application has many different business objects associated with it because the application manages multiple processes for the organization. It's also worth noting that these complete business objects are tied to the application, not the data source. That means that if the application spans multiple data sources, then your business object will as well.
Another feature of complete business objects is that they include reference tables and transactional tables. In data mart and warehouse terminology, business objects include both the dimension and fact tables. The goal is to capture all the data related to a business process, whether that data is directly associated with an individual transaction, or provides additional context for that transaction. The following sections cover how the complete business object is central to each solution of InfoSphere Optim, as well as examining the problems associated with complete business objects.
Business objects in data growth systems
The concept of a complete business object is implemented in InfoSphere Optim using a metadata definition called an access definition. It is used in every process related to archiving systems. For data extraction through an InfoSphere Optim Archive Request, the complete business object is used to create an immutable historical snapshot of the data at a point in time. Not only is the transactional data archived, but so is the reference data that gives the transactional data context. This context is crucial in order to give the transactional data meaning when the data in the archives is recalled and reported on at a later point in time. For example, look at the simplified but well known example of the Optim sample database shown in Figure 1.
Figure 1. The InfoSphere Optim sample database schema
In this example, the complete business object reflects the business process of an order being placed. A customer can have many orders, and each order can contain many items. Each item can be ordered many times. Details acts as a junction (or intersection) table between orders and items to model a many-to-many relationship.
The tables that create a data growth problem are going to be the transactional tables. These are the ORDER and DETAILS tables shown in Figure 2. They grow at very high rates and will have a disproportionately high number of records inside them. Even so, you would archive (but not delete) all the related data from ITEM, CUSTOMER, and SALES_PERSON, because these tables provide important context for the transactional data.
Figure 2. A diagram showing the transaction and reference tables for the order business object
The same reasoning applies if the data, or a subset of it, ever needs to be restored to another location. The complete business object ensures that the full context of the data can be restored and all relationships maintained. Without the complete business object, much of the restored data would be rendered meaningless.
In addition to this, Optim organizes archives by the complete business object. Each business object gets its own set of archive files. This allows you to manage retention periods and the data lifecycle based on these complete business objects as the data ages. Organizing archives based on business objects allows you to define different storage tiers and storage policies according to the business process involved.
Deletions occur after an archive, and normally only on the transaction tables. The good news is that the transaction tables are the tables that cause data growth issues. So, in the previous example, deletion would occur on the ORDER and DETAILS tables. Reference data is generally not deleted because it is still active. A customer may come back and make a new order; or an item in the ITEM table could be purchased again, and so on. This makes it impractical to delete from these tables.
The data in ORDER and DETAILS is unlikely to change after a certain period. That period translates into an archive window which results in the definition of selection criteria. Having the business object defined allows InfoSphere Optim to move upward in the hierarchy toward parent tables during deletion. So, as shown in Figure 3, Optim would delete DETAILS first and then the ORDERS data, ensuring that referential integrity is maintained. Without the complete business object definition including the list of tables and relationships between tables, Optim would not have enough information to ensure that this referential integrity is maintained.
Figure 3. Order of deletion if the DETAILS and ORDER tables are marked for data deletion
Business objects in test data management and data masking systems
The primary purpose of complete business objects in test data management is that they allow InfoSphere Optim to create relationally intact subsets of data. These subsets allow you to create test databases that are only as large as you need for your particular testing purposes, and no more. It allows you to do away with having dozens of production clones that are difficult to manage.
InfoSphere Optim requires the complete business object to be defined in order to create effective subsets. For instance, if Optim understands the relationships in the data model, it is able to extract every 20th ORDER record and the data related to every 20th ORDER, and no more. It maintains referential integrity by moving through the model you create and picking up all related data. This helps ensure that orphans are not created, which may cause the application module being tested to fail unintentionally. Just as important, Optim leaves behind any unrelated data, decreasing the size of the test data set. Without the creation of a complete business object, this kind of intelligent decision making about what data to include or exclude would not be possible.
In terms of data masking, understanding and modeling the relationships in the complete business object allows you to propagate sensitive data through the data model if that sensitive data acts as a key. For instance, in the previous example in Figure 3, if the CUSTOMER table's key was a social security number, if could be propagated into the ORDER table where it acts as a reference (foreign key) to the CUSTOMER table. Without an understanding of the business object and its relationships, masking data would create orphans, which again would make the application fail unintentionally.
Business objects in application retirement systems
Application Retirement is the least compelling use case for the definition of a complete business object. The reason for this is that, most often, the objective of application retirement is to archive all the data from a system while maintaining access to that data. Archiving all the data and providing access can be done with a very modest understanding of the relationships involved in the application that you are retiring.
Even so, you should use caution if you decide to not create complete business objects to base your archives off of for application retirement scenarios. An understanding of the relationships in the data is required for effective use of the data post-retirement. Any reports that need to be created after the archive (for an e-discovery request for example) will rely on having knowledge of the relationships in the archived data in order to make the results meaningful. At the same time, in the application retirement scenario, you are losing the application itself, which may be an important source of information for the relationships. As such, it is a better practice to create the relationships and the business objects for application retirement scenarios as well. By doing this you will make Optim self-documenting.
Aside from this, it's worth mentioning again that quite often some of the retired data may have different retention and access requirements than other data in the same application. Not all data in legacy data sources is equally useful or should be retained for the same period of time. Dividing the archives into business processes will help you define of retention periods and storage tiers effectively. In order for this to be accomplished, you have to define the boundaries between your business objects.
The problems with complete business objects
As you can see, the way InfoSphere Optim uses a complete business object is a major advantage. InfoSphere Optim uses processes that can make intelligent decisions about extraction, deletion, and insertion of data based on a thorough understanding of the relationships in the data. Even so, relying on the concept of a complete business object does create the following two problems.
- The relationship definition problem: What if you do not know what the relationships are in the data?
- The boundary definition problem: Even if you know the relationships, how do you know when a business object begins and ends?
You can use the tools and techniques discussed in Part 2 of this article series to solve these problems.
It is worth noting that not all of the business objects in an application require modeling. As shown in Figure 4, the benefits of test data management and data growth grow quickly with the amount of data contained within a business object, but the costs of modeling a business object have less to do with the amount of data, and more with the complexity of the data model. A business object with a small amount of data may have just as many tables and relationships between tables as a business object with a large amount of data. This drives many customers to apply test data management and archiving to just those application objects that are the most data intensive. So, while the payroll business object may get the full and intensive archiving and test data management treatment, the succession planning and collective bargaining application objects might not because those application objects have dramatically fewer records inside them. So the benefits would not be worth the effort.
Figure 4. Cost and benefit of test data management/archiving relating to amount of data
This article, Part 1 of a series, took a look at the Optim central metadata definition called the complete business object. It discussed the business object and how it provides value to Optim. It also outlined the relationship and table boundary issues, which are the two fundamental problems in complete business object definition. Part 2 of this article series will examine the tools available to help solve these problems. With these tools, you can dramatically reduce the amount of time required for data modeling in your Optim project.
- Thanks to Polly Lau and Kitman Cheung at IBM Toronto Lab for their continued guidance on Optim.
- Thanks to Vineet Goel at IBM with the Optim team for his guidance on the Optim Application Repository Analyzer.
- Thanks to Connie Chan at IBM for her guidance on InfoSphere Discovery and for reviewing the article.
- Thanks to Matt Simons, Greg Marshall, and David Slater at Information Insights for the discussions we had on project-related problems with data modeling for InfoSphere Optim. Special thanks to Matt Simons for reviewing the article.
- Use an RSS feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Get the information you need in the InfoSphere Data Architect Information Center to learn more about its relationship discovery capabilities.
- Read the developerWorks article on InfoSphere Data Architect (then known as Rational Data Architect) to learn more about Relationship Discovery.
- Check out the developerWorks article that provide an overview of InfoSphere Discovery Capabilities not just relationship discovery.
- The Optim Solutions Library is a fantastic resource and a central point for general information on InfoSphere Optim and related products
- The InfoSphere Discovery Information Center has more information on best practices and methodologies for using the tool for a relationship discovery use case
- Get the resources you need in the Information Management area on developerWorks, to advance your skills on a wide variety of IBM Information Management products.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently..