IBM InfoSphere DataStage job validation steps using IBM Optim Test Data Management Solution

Large numbers of extract, transform, and load (ETL) jobs need to be validated during the following two common job lifecycle scenarios: Migrating ETL projects/jobs from older version to a new version, or moving jobs from development to QA to production. Enterprises typically validate that their jobs running in a new version of the software or new hardware environment are producing same results as before, giving them confidence that the new system can replace the old system. In a similar way, before a job in the data integration process is deployed in the production environment, it must be shown to have the expected behavior in development, testing, and production environments. In this article, a step-by-step example is given on how DataStage users can use IBM InfoSphere Optim Test Data Management Solution to validate the results of ETL jobs.

Ajay Sood (r1sood@in.ibm.com), Software Architect, IBM  

Author PhotoAjay Sood is a Software Architect at IBM Software Labs in India. He has been with IBM for more than 15 years. He is currently part of the IBM InfoSphere Information Server product development team with a focus on DataStage components and also has been part of other product development teams like TXSeries-CICS and DB2 DataLinks. His experiences include work in the areas of information management, transaction processing, middleware, networking and system programming on UNIX platforms.



28 March 2013

Also available in Chinese

Introduction: Requirement for validation of DataStage jobs

Businesses today are leading information-centric projects to transform their business and deliver cost savings. Many data integration or information integration applications or processes involve ETL as one of the components.

Typically an ETL process (unit of work) is designed to accomplish the following:

  • Extract: Extract the data from the source system and cleanse it.
  • Transform: Transform the data into a desired format that can be consumed in the next step. Typically this would involve applying the core business logic to convert data into information.
  • Load: Loading the data, typically into a database table/warehouse, to be used by a reporting engine for deriving insights out of the transformed data.

There are two common life cycles that a job in a data integration application undergoes

  • Porting / Migration of jobs from an older version to a newer version of the DataStage software or hardware running it.
  • Movement of jobs from Development to Testing to Production environments.

Both of the above use cases require validation of a large number of DataStage jobs. Enterprises typically validate that their jobs running in a new version of the software or new hardware environment are producing same results as before, giving them confidence that the new system can replace the old system. In a similar way, before a job in the Data Integration process is deployed in the production environment, it must be shown to have the expected behavior in Development, Testing and Production environments.

In this article, a step by step example is given on how DataStage users can use IBM InfoSphere Optim Test Data Management Solution to validate results of ETL jobs.


Using Optim Test Data solution with DataStage

During the validation process for DataStage jobs, Optim Test Data Solution can be used to

  • Generate test data
  • Compare the job output with an expected or benchmark output

The DataStage job will refer to the generated test data as the source of input data during the validation process. Once the DataStage job is executed, the comparison step is performed to validate the resulting output.

The workflow could be represented as shown in Figure 1.

Figure 1. Workflow for validating DataStage jobs using Optim TDM
This figure describes the high level steps that need to be performed to validate DataStage jobs using Optim Test Data Management feature

In the subsequent sections, you will see an example how to generate test data by a DataStage job and then compare the final output with an already available expected output to validate the job.


Software prerequisites

For the purpose of this illustration the following software was used.

  • IBM InfoSphere Information Server 8.7
  • IBM InfoSphere Optim 8.1
  • IBM DB2 LUW 9.7
  • Windows 2008

Any other database supported by both Information Server and Optim could also be used for the purpose even though this article uses DB2 for illustration purposes.


Sample DataStage job

In this example, you will see how the Optim Test Data solution is used to generate test data, and compares the final output for a DataStage job.

Consider a sample job that reads data from one table in DB2, sorts the data on a key, and loads the sorted data in another table in the database. The data is read by the source connector stage from the source database table [OPTIM_CUSTOMERS3], is passed to a sort stage and then loaded to a destination table using a target database connector stage [OPTIM_CUSTOMERS4], as shown in Figure 2.

Figure 2. Sample DataStage job
This figure illustrates the sample Datastage job used in the example used as an example in the article

The source database connector properties point to data available in OPTIM_CUSTOMERS3. You can see the properties in Figure 3.

Figure 3. Sample job source DB connector properties
This figure illustrates the connection properties for the sample Datastage job used as an example in the article

Figure 4 shows the relevant columns for the source table.

Figure 4. Sample job source DB column properties
This figure illustrates the column properties for the input stage of the sample Datastage job used as an example in the article

Similarly, the target connector points to the database table OPTIM_CUSTOMERS4, and the output of the job is written to OPTIM_CUSTOMERS4.

The aim is to have some test data generated using Optim and use that to verify the output of the job as part of regression testing of the job.

Subsequent examples will include how to configure Optim to set up test data and benchmark data so that during the regression testing, the ETL job could read data from the test table and compare the output against the benchmark data.

The next few sections give descriptions for achieving the Optim related setup.


Optim installation and configuration

Install Optim and configure an Optim directory (repository for all the extract definitions, compare requests, and so on). Follow Optim documentation for the purpose of installation and configuration.

For the purpose of this example, the parameter values shown in Table 1 are considered.

Table 1. Parameter values
ParameterValue
Optim directoryOPTIMDIR
TEST DATABASEOPTIMDB
CONNECTION STRINGOPTIMDB
USERAdministrator
OPTIMDB AliasOPDB1

Verifying the configuration

You can verify the configurations by performing the following the steps.

  1. Go to the Optim main window, as shown in Figure 5.
    Figure 5. Optim main window
    This figure shows a screenshot of the main window that is launched on the invocation of Optim
  2. Click File > Optim Directory. You will get a screen as shown in Figure 6, which shows the value of the Optim directory.
    Figure 6. Optim Directory with Connect option
    This figure shows the Optim Directory Configuration window
  3. Click Connect and you can view the settings shown in Figure 7.
    Figure 7. Optim Directory with Disconnect option
    This figure shows the optim directory configuration window when a user presses Connect button
  4. Now, click View to see the settings for the Optim Directory, as shown in Figure 8.
    Figure 8. Optim Directory editor
    This figure shows the database version and type in the optim directory configuration window
  5. The Connection tab shows the connection string, as shown in Figure 9.
    Figure 9. Optim Directory Editor to view the connection string
    This figure shows the connection string in the optim directory configuration window
  6. The Server tab shows the details about the collection name, as shown in Figure 10.
    Figure 10. Optim Directory Editor to view the server details
    This figure shows the collection name in the optim directory configuration window

Setting up test data

The following steps show how to set up a table with test data to test a DataStage job.

Setting up an extract definition for generating test data

  1. Go back to the Optim main screen, as shown in Figure 11.
    Figure 11. Optim main screen
    This figure shows the Optim main window with various possible actions
  2. Click Action > Extract. The Extract Request Editor lets you design the extract request. The extract request name in this case is TEST3.EXT1, as shown in Figure 12.
    Figure 12. Extract Request Editor for extracting test data
    This figure shows the Optim extract editor with sample values used to setup the test data

    You have an option to choose if you want to extract only a limited set of rows (say 500) for the purpose of setting up test data.

Now you need to set up the access definitions.

Setting up an access definition for generating test data

Perform the following steps to set up an Access Definition for generating test data.

  1. Open the Extract Request Editor screen, then click File > Edit Access Definitions, as shown in Figure 13.
    Figure 13. Access Definition Editor during extraction of test data
    This figure shows the Optim access definition editor with sample values used to setup the test data
    The Access Editor could also be used to filter out rows that are not required as part of test data. For example, every nth row could be extracted.
  2. Click File > Update and Return on the Access Editor to go back to the Extract Editor.

Filtering out columns to be loaded or compared

Using table map editors and column map editors: One of the aims for designing convert requests using table map editors and column map editors is to filter out columns which are not required for the final comparison. The following options are for filtering out such columns.

  • Do not extract the columns in the test data: Set up a convert request at the time of the extract request for test and benchmark data.
  • Do not load the columns into the test and benchmark tables at the time of setting up the load request.
  • Filter out the columns at the time of designing the compare request [request to compare the output with the benchmark data].

There might be some scenarios where a user wants to filter out columns to set up multiple test cases. For example, the first scenario to be tested could require the comparison of columns A,B,C, and another scenario could require comparison of columns D,E,F. In such scenarios, the user might want to extract all the columns and then set up multiple load requests to load different columns in different tables. Also, at the time of the comparison, a user could choose not to compare certain columns if the data is such that the data for those columns will never match (for example, timestamps).

Executing the Extract definition

  1. From the Extract Editor, click File > Run. The Extract Request Progress screen appears, which shows the progress of the extract process, as shown in Figure 14.
    Figure 14. Extract Request Execution
    this figure shows the screeshot of Optim extract definition editor at the time of execution of extract step to setup the test data
    After this step, an extract file is created on the disk. This extract file is a dump of the data from the table in a specific format which can be further processed using Optim’s utilities. The file [test3.xf in this example] gets created in the path specified in the preferences for datadir (C:\IBM Optim\RT\BIN\datadir in this example).
  2. From the main Optim window, click Options > Personal. The Personal Option editor appears as shown in Figure 15.
    Figure 15. Personal Options
    This figure shows the Optim personal options window which illustrates the sample personal option values used in our example. The values for example tell about where the various extract files etc are stored.

A report is generated showing the details of the extract. The next step is to load the test data into a test data table. For loading the test data, a load request needs to be set up.

Loading the test data into a test database table

  1. From the main Optim screen, click Actions > Load, as shown in Figure 16.
    Figure 16. Load option in Optim main screen
    This figure shows the Optim main window with various possible actions
  2. The Load Request Editor Options screen appears. In this example, the data from the extract file test3.xf needs to be loaded into the table [OPTIM_CUSTOMERS6], and this table will act as the test data table, as shown in Figure 17.
    Figure 17. Optim Load Request Editor options
    This figure gives a screen shot of the Optim Load request editor. This editor is used to setup a request to help load test data into the database table.

You will have to load the data from test3.xf, which was set up during the extract phase.

Setting up table and column maps during load of test data

The next step is to set up a table map to, in turn, set up the destination table, and then a column map to load the relevant data into the destination table, as shown in Figure 18.

Figure 18. Table and column map during load of test data
This figure gives a screen shot of the Optim table map editor used during setting up a load request. This editor is used to setup the table and column maps during load of test data into the destination database table.

Optionally, click the Open Column Map option to set up the column map (to filter out columns which a user does not want to load as part of test data or just save the column map that is presented). The default table definition will be based on the table definition in the extract file, test3.xf in this case.

Once all the column map and table map definitions are updated, click File > Run from the Load Map Editor.

Executing the load request to load test data

Click File > Run to execute the load request from the Load Request Editor, as shown in Figure 19.

Figure 19. Executing load request from Load Request Editor
This figure shows a screen shot of how a Optim load map editor can be used to execute the load request. In this case, the load request is setup to help load test data into the destination database table setup for the test data.

Test data gets loaded into the test data table OPTIM_CUSTOMERS6, which will be used by the DataStage job for the purpose of accessing test data during regression testing.

The execution of the load request will also create the destination table in case it does not exist.

Creating the destination table for test data

You will be prompted to create the destination table [OPTIM_CUSTOMERS6 in this case] if the table does not exist, as shown in Figure 20.

Figure 20. Creating the destination table for test data
This figure gives a screen shot of the Optim create table editor that gets displayed when the test data is being loaded using Optim load request editor. This window lets you specify the

Running the create step shows the SQL used to create the destination table. If you want to create the test data table with a different schema than the actual production table, you can modify the SQL presented and generate the table with a subset of columns, as shown in Figure 21.

Figure 21. SQL for create table (test data table)
This figure gives a screen shot of the SQL query that gets displayed when the test data is being loaded using Optim create request editor which in turn is launched from the load request editor..

Click Proceed to create the table and the associated keys.

After the table creation, the data gets loaded, as shown in Figure 22.

Figure 22. Load Request Progress window
Ths figure shows the screen shot that gets displayed when the load request is in progress.

Setting up the expected output

Setting up the expected output (or the benchmark data) will also involve steps similar to the steps involved in setting up the test data (extract and load the benchmark data).

Extracting the benchmark data

In this case, the benchmark data is extracted from OPTIM_CUSTOMERS4 and loaded into OPTIM_CUSTOMERS7.

  1. Go to the Optim main screen. Click Action > Extract. The Extract Request Editor appears where you can design the extract request. The extract request name in this case is test4.xf, as shown in Figure 23.
    Figure 23. Extract definition for benchmark data
    This figure shows the screen shot of the Optim extract request editor. It is shown with sample data to setup the extract request for setting up the benchmark data.
  2. To set up an access definition for generating benchmark data, open the Extract Request Editor, click File > Edit Access Definitions. The Access Definition Editor appears, as shown in Figure 24.
    Figure 24. Access definition for benchmark data
    This figure shows the screen shot of the Optim access definition editor. It is shown with sample data to setup the access definitions for setting up the expected output.
  3. Click File > Run to execute the load request from the Load Request Editor, as shown in Figure 25.
    Figure 25. Load Request for benchmark data
    This figure shows the Optim load request editor filled up with sample data to load the benchmark data into the destination database table.
  4. Figure 26 shows the Table Map Editor for the load request report of the benchmark data.
    Figure 26. Table Map Editor for load request report of benchmark data
    This figure shows the screen shot of the Optim load map editor required by the load request to set up the benchmark data.
  5. You can view the Load Process Report, as shown in Figure 27.
    Figure 27. Load Process Report
    This figure shows the screen shot of the final report generated after execution of the load request to setup the benchmark data in the database.

After setting up both the test and benchmark data, the next step is to set up a compare request that will compare the output data generated by the execution of an ETL job during regression testing with the benchmark data.


Setting up compare request

Create the table that will be populated by the output of the DataStage job. A compare request needs both of the tables involved to exist at design time, as shown in Figure 28.

Figure 28. Choosing Compare from the Optim main window
This figure shows the screen shot of the Optim main window with various possible actions.

Invoking the Compare Request Editor

  1. From the main Optim window, click Actions > Compare to design a compare request.
  2. In the example shown in Figure 29, you can compare the following two tables:
    • OPTIM_CUSTOMERS7: which has the benchmark data.
    • OPTIM_CUSTOMERS8: which will get populated once the job is run with the test data set up in OPTIM_CUSTOMERS4.
    Figure 29. Compare Request Editor
    This figure shows the screen shot of the Optim Compare request editor. This screenshot has sample values for setting up the compare request to let the user compare the test run output with the benchmark data already setup in the database table.
  3. Edit the Column Map Editor if you want to filter out columns, which need not be involved in the comparison of the output of the job with the benchmark data, as shown in Figure 30.
    Figure 30. Column Map Editor for compare request
    This figure shows the screen shot of the Optim column map editor launched via the Optim compare request editor. This is used to setup the list of columns that need to be compared.

Setting up the report request for comparison results

A report request can be set up to specify the destination of the final comparison report, as shown in Figure 31.

Figure 31. Report Request Editor for final comparison results
This figure shows the screen shot for the Optim report request editor. This is used to setup the details for the location of the final comparison report between the test run data and the benchmark data.

Save the compare request to be invoked later from the command line after the DataStage job is executed using the test data.


Executing the DataStage job with test data

Edit the DataStage job to use test data

In this sample job, edit the connector properties for the source stage to point to test data.

The source connector stage is modified to read data from OPTIM_CUSTOMERS6, and modify the target connector stage to write output to OPTIM_CUSTOMERS8.

If the job is appropriately parameterized to take the connection properties as job parameters, then you would not need to change the stage properties and recompile the job.

Setting up a command line to invoke Optim Compare Request

The following is a sample command line invocation of the compare request.

c:\ibmoptim\rt\bin\pr0cmnd" /r type=compare request=test22.cmp output=cmpreq.txt

Job invocation with test data and execution of comparison command

The job is executed to compare the output (OPTIM_CUSTOMERS8) to the benchmark data (OUTPUT_CUSTOMERS7).


Verifying the results of job execution

At this time, the compare request can be executed from the command line. A sample of the summary output report generated by the comparison request invocation is shown in Figure 32.

Figure 32. Sample output from execution of comparison command
This figure shows a sample output after the execution of the Optim comparison request. In our case we compared the test run output of the sample job with the benchmark data already setup using the extract and load processing of Optim.

The comparison could optionally be executed from Optim also. The comparison report could also be viewed using the report editor from Optim which will give a detailed view of the mismatches.


Conclusion

Job validation during a job migration phase as part of regular development lifecycle, or version to version migration requires tools to set up test data and comparisons of generated output. The IBM Optim TDM solution has capabilities in both areas and can be exploited by DataStage for the purpose of job validation. Users can exploit the capabilities of Optim to compare complex constructs stored in databases for the purpose of validating the final output.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=862587
ArticleTitle=IBM InfoSphere DataStage job validation steps using IBM Optim Test Data Management Solution
publish-date=03282013