Introduction: Requirement for validation of DataStage jobs
Businesses today are leading information-centric projects to transform their business and deliver cost savings. Many data integration or information integration applications or processes involve ETL as one of the components.
Typically an ETL process (unit of work) is designed to accomplish the following:
- Extract: Extract the data from the source system and cleanse it.
- Transform: Transform the data into a desired format that can be consumed in the next step. Typically this would involve applying the core business logic to convert data into information.
- Load: Loading the data, typically into a database table/warehouse, to be used by a reporting engine for deriving insights out of the transformed data.
There are two common life cycles that a job in a data integration application undergoes
- Porting / Migration of jobs from an older version to a newer version of the DataStage software or hardware running it.
- Movement of jobs from Development to Testing to Production environments.
Both of the above use cases require validation of a large number of DataStage jobs. Enterprises typically validate that their jobs running in a new version of the software or new hardware environment are producing same results as before, giving them confidence that the new system can replace the old system. In a similar way, before a job in the Data Integration process is deployed in the production environment, it must be shown to have the expected behavior in Development, Testing and Production environments.
In this article, a step by step example is given on how DataStage users can use IBM InfoSphere Optim Test Data Management Solution to validate results of ETL jobs.
Using Optim Test Data solution with DataStage
During the validation process for DataStage jobs, Optim Test Data Solution can be used to
- Generate test data
- Compare the job output with an expected or benchmark output
The DataStage job will refer to the generated test data as the source of input data during the validation process. Once the DataStage job is executed, the comparison step is performed to validate the resulting output.
The workflow could be represented as shown in Figure 1.
Figure 1. Workflow for validating DataStage jobs using Optim TDM
In the subsequent sections, you will see an example how to generate test data by a DataStage job and then compare the final output with an already available expected output to validate the job.
For the purpose of this illustration the following software was used.
- IBM InfoSphere Information Server 8.7
- IBM InfoSphere Optim 8.1
- IBM DB2 LUW 9.7
- Windows 2008
Any other database supported by both Information Server and Optim could also be used for the purpose even though this article uses DB2 for illustration purposes.
Sample DataStage job
In this example, you will see how the Optim Test Data solution is used to generate test data, and compares the final output for a DataStage job.
Consider a sample job that reads data from one table in DB2, sorts the data on a key, and loads the sorted data in another table in the database. The data is read by the source connector stage from the source database table [OPTIM_CUSTOMERS3], is passed to a sort stage and then loaded to a destination table using a target database connector stage [OPTIM_CUSTOMERS4], as shown in Figure 2.
Figure 2. Sample DataStage job
The source database connector properties point to data available in OPTIM_CUSTOMERS3. You can see the properties in Figure 3.
Figure 3. Sample job source DB connector properties
Figure 4 shows the relevant columns for the source table.
Figure 4. Sample job source DB column properties
Similarly, the target connector points to the database table OPTIM_CUSTOMERS4, and the output of the job is written to OPTIM_CUSTOMERS4.
The aim is to have some test data generated using Optim and use that to verify the output of the job as part of regression testing of the job.
Subsequent examples will include how to configure Optim to set up test data and benchmark data so that during the regression testing, the ETL job could read data from the test table and compare the output against the benchmark data.
The next few sections give descriptions for achieving the Optim related setup.
Optim installation and configuration
Install Optim and configure an Optim directory (repository for all the extract definitions, compare requests, and so on). Follow Optim documentation for the purpose of installation and configuration.
For the purpose of this example, the parameter values shown in Table 1 are considered.
Table 1. Parameter values
Verifying the configuration
You can verify the configurations by performing the following the steps.
- Go to the Optim main window, as shown in Figure 5.
Figure 5. Optim main window
- Click File > Optim Directory. You
will get a screen as shown in Figure 6, which shows the value of the
Figure 6. Optim Directory with Connect option
- Click Connect and you can view the settings shown in
Figure 7. Optim Directory with Disconnect option
- Now, click View to see the settings for the Optim
Directory, as shown in Figure 8.
Figure 8. Optim Directory editor
- The Connection tab shows the connection string, as
shown in Figure 9.
Figure 9. Optim Directory Editor to view the connection string
- The Server tab shows the details about the collection
name, as shown in Figure 10.
Figure 10. Optim Directory Editor to view the server details
Setting up test data
The following steps show how to set up a table with test data to test a DataStage job.
Setting up an extract definition for generating test data
- Go back to the Optim main screen, as shown in Figure 11.
Figure 11. Optim main screen
- Click Action > Extract. The
Extract Request Editor lets you design the extract request. The
extract request name in this case is TEST3.EXT1, as shown in Figure
Figure 12. Extract Request Editor for extracting test data
You have an option to choose if you want to extract only a limited set of rows (say 500) for the purpose of setting up test data.
Now you need to set up the access definitions.
Setting up an access definition for generating test data
Perform the following steps to set up an Access Definition for generating test data.
- Open the Extract Request Editor screen, then click
File > Edit Access
Definitions, as shown in Figure 13.
Figure 13. Access Definition Editor during extraction of test data
The Access Editor could also be used to filter out rows that are not required as part of test data. For example, every nth row could be extracted.
- Click File > Update and Return on the Access Editor to go back to the Extract Editor.
Filtering out columns to be loaded or compared
Using table map editors and column map editors: One of the aims for designing convert requests using table map editors and column map editors is to filter out columns which are not required for the final comparison. The following options are for filtering out such columns.
- Do not extract the columns in the test data: Set up a convert request at the time of the extract request for test and benchmark data.
- Do not load the columns into the test and benchmark tables at the time of setting up the load request.
- Filter out the columns at the time of designing the compare request [request to compare the output with the benchmark data].
There might be some scenarios where a user wants to filter out columns to set up multiple test cases. For example, the first scenario to be tested could require the comparison of columns A,B,C, and another scenario could require comparison of columns D,E,F. In such scenarios, the user might want to extract all the columns and then set up multiple load requests to load different columns in different tables. Also, at the time of the comparison, a user could choose not to compare certain columns if the data is such that the data for those columns will never match (for example, timestamps).
Executing the Extract definition
- From the Extract Editor, click File >
Run. The Extract Request Progress screen appears,
which shows the progress of the extract process, as shown in Figure
Figure 14. Extract Request Execution
After this step, an extract file is created on the disk. This extract file is a dump of the data from the table in a specific format which can be further processed using Optim’s utilities. The file [test3.xf in this example] gets created in the path specified in the preferences for datadir (C:\IBM Optim\RT\BIN\datadir in this example).
- From the main Optim window, click Options >
Personal. The Personal Option editor appears as
shown in Figure 15.
Figure 15. Personal Options
A report is generated showing the details of the extract. The next step is to load the test data into a test data table. For loading the test data, a load request needs to be set up.
Loading the test data into a test database table
- From the main Optim screen, click Actions >
Load, as shown in Figure 16.
Figure 16. Load option in Optim main screen
- The Load Request Editor Options screen appears. In
this example, the data from the extract file test3.xf needs to be
loaded into the table [OPTIM_CUSTOMERS6], and this table will act as
the test data table, as shown in Figure 17.
Figure 17. Optim Load Request Editor options
You will have to load the data from test3.xf, which was set up during the extract phase.
Setting up table and column maps during load of test data
The next step is to set up a table map to, in turn, set up the destination table, and then a column map to load the relevant data into the destination table, as shown in Figure 18.
Figure 18. Table and column map during load of test data
Optionally, click the Open Column Map option to set up the column map (to filter out columns which a user does not want to load as part of test data or just save the column map that is presented). The default table definition will be based on the table definition in the extract file, test3.xf in this case.
Once all the column map and table map definitions are updated, click File > Run from the Load Map Editor.
Executing the load request to load test data
Click File > Run to execute the load request from the Load Request Editor, as shown in Figure 19.
Figure 19. Executing load request from Load Request Editor
Test data gets loaded into the test data table OPTIM_CUSTOMERS6, which will be used by the DataStage job for the purpose of accessing test data during regression testing.
The execution of the load request will also create the destination table in case it does not exist.
Creating the destination table for test data
You will be prompted to create the destination table [OPTIM_CUSTOMERS6 in this case] if the table does not exist, as shown in Figure 20.
Figure 20. Creating the destination table for test data
Running the create step shows the SQL used to create the destination table. If you want to create the test data table with a different schema than the actual production table, you can modify the SQL presented and generate the table with a subset of columns, as shown in Figure 21.
Figure 21. SQL for create table (test data table)
Click Proceed to create the table and the associated keys.
After the table creation, the data gets loaded, as shown in Figure 22.
Figure 22. Load Request Progress window
Setting up the expected output
Setting up the expected output (or the benchmark data) will also involve steps similar to the steps involved in setting up the test data (extract and load the benchmark data).
Extracting the benchmark data
In this case, the benchmark data is extracted from OPTIM_CUSTOMERS4 and loaded into OPTIM_CUSTOMERS7.
- Go to the Optim main screen. Click Action >
Extract. The Extract Request Editor appears where
you can design the extract request. The extract request name in this
case is test4.xf, as shown in Figure 23.
Figure 23. Extract definition for benchmark data
- To set up an access definition for generating benchmark data, open the
Extract Request Editor, click File > Edit
Access Definitions. The Access Definition Editor appears,
as shown in Figure 24.
Figure 24. Access definition for benchmark data
- Click File > Run to execute the
load request from the Load Request Editor, as shown in Figure 25.
Figure 25. Load Request for benchmark data
- Figure 26 shows the Table Map Editor for the load request report of
the benchmark data.
Figure 26. Table Map Editor for load request report of benchmark data
- You can view the Load Process Report, as shown in Figure 27.
Figure 27. Load Process Report
After setting up both the test and benchmark data, the next step is to set up a compare request that will compare the output data generated by the execution of an ETL job during regression testing with the benchmark data.
Setting up compare request
Create the table that will be populated by the output of the DataStage job. A compare request needs both of the tables involved to exist at design time, as shown in Figure 28.
Figure 28. Choosing Compare from the Optim main window
Invoking the Compare Request Editor
- From the main Optim window, click Actions > Compare to design a compare request.
- In the example shown in Figure 29, you can compare the following two
- OPTIM_CUSTOMERS7: which has the benchmark data.
- OPTIM_CUSTOMERS8: which will get populated once the job is run with the test data set up in OPTIM_CUSTOMERS4.
Figure 29. Compare Request Editor
- Edit the Column Map Editor if you want to filter out columns, which
need not be involved in the comparison of the output of the job with
the benchmark data, as shown in Figure 30.
Figure 30. Column Map Editor for compare request
Setting up the report request for comparison results
A report request can be set up to specify the destination of the final comparison report, as shown in Figure 31.
Figure 31. Report Request Editor for final comparison results
Save the compare request to be invoked later from the command line after the DataStage job is executed using the test data.
Executing the DataStage job with test data
Edit the DataStage job to use test data
In this sample job, edit the connector properties for the source stage to point to test data.
The source connector stage is modified to read data from OPTIM_CUSTOMERS6, and modify the target connector stage to write output to OPTIM_CUSTOMERS8.
If the job is appropriately parameterized to take the connection properties as job parameters, then you would not need to change the stage properties and recompile the job.
Setting up a command line to invoke Optim Compare Request
The following is a sample command line invocation of the compare request.
c:\ibmoptim\rt\bin\pr0cmnd" /r type=compare request=test22.cmp output=cmpreq.txt
Job invocation with test data and execution of comparison command
The job is executed to compare the output (OPTIM_CUSTOMERS8) to the benchmark data (OUTPUT_CUSTOMERS7).
Verifying the results of job execution
At this time, the compare request can be executed from the command line. A sample of the summary output report generated by the comparison request invocation is shown in Figure 32.
Figure 32. Sample output from execution of comparison command
The comparison could optionally be executed from Optim also. The comparison report could also be viewed using the report editor from Optim which will give a detailed view of the mismatches.
Job validation during a job migration phase as part of regular development lifecycle, or version to version migration requires tools to set up test data and comparisons of generated output. The IBM Optim TDM solution has capabilities in both areas and can be exploited by DataStage for the purpose of job validation. Users can exploit the capabilities of Optim to compare complex constructs stored in databases for the purpose of validating the final output.
- Refer to the IBM InfoSphere Optim Solutions Information Center for more information on Optim.
- Refer to the IBM InfoSphere Information Server Information Center for more information on InfoSphere.
- Refer to the developerWorks article "Enable C++ applications for Web services using XML-RPC" (developerWorks, Jun 2006).
- In the XML area on developerWorks, get the resources you need to advance your XML skills, including DTDs, schemas, and XSLT.
- Visit the developerWorks Information Management zone to find more resources for DB2 developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Participate in the discussion forum.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.