In data integration projects, multiple people often work on different aspects of projects at the same time while passing through multiple iterations. This is unavoidable due to changing requirements, the need to perform tasks in parallel, or simply due to mistakes that are made along the way. This often leads to changing base components that other aspects of a project rely on.
In the case of a data mapping specification, mapping specifications often get modified by multiple people at the same time, data models change frequently along way, and DataStage jobs get changed independently of a mapping specification.
This article will show you how to use IBM InfoSphere FastTrack, IBM InfoSphere DataStage, and IBM InfoSphere Data Architect to deal with these situations.
This article assumes that:
- The reader has basic knowledge of FastTrack.
- The reader has basic knowledge of physical database models.
- The reader has a basic understanding of the development process involving multiple user roles, like a business analyst using FastTrack, a developer using IBM InfoSphere DataStage, and a data modeler using IBM InfoSphere Data Architect.
- Import a physical data model from a modeling tool into the metadata repository.
- Create a mapping specification that uses the imported model.
- Switch a data model that is used in an existing mapping specification to a different data model instance.
- Incorporate external changes to an already mapped data model within a existing mapping specification.
- Annotate a previously generated DataStage job with changes that have been made to the mapping specification.
In the first part of this article you will learn how to update mapping specifications for when columns within either the source or target of an existing model instance have changed. The following example assumes the involvement of InfoSphere Data Architect, and the Metadata Brokers and Bridges to transfer data models to the Metadata Repository. You should also be able to perform a similar process with any other data modeling tool by using the appropriate export/import mechanism that applies to that tool. Alternatively if you do not leverage a data modeling tool, then you should be able use a DDL file to create a database table, and import this to the Metadata Repository from within FastTrack.
Follow these steps to import metadata from a data modeling tool into the metadata repository
- Within FastTrack, click Metadata > Metadata Repository.
- Select New Host.
- Give the new host the name
TARGETHOST_1. This host will be used to hold the physical database model that gets imported into the metadata repository.
- Start IBM InfoSphere Data Architect. In project Lab, open the physical database model called Database Model.dbm. Data Architect is a file based tool, so what you see is a file that is located on the local file system that contains a physical database model. This model contains a customer entity.
- From the File menu, click Export. Click Data, and then select Export a Physical Model to the Metadata Server to transfer the model into the metadata repository.
- Click Next, and select the Database Model.dbm. For the host name, specify the newly created TARGETHOST_1, and click Finish.
- Click Select All to transfer all objects from
Database Model.dbminto the repository. You could also specify a filter here. Specify the repository credentials and click OK to transfer all objects into the metadata repository.
You will use the previously imported metadata within a mapping specification.
- Within FastTrack, click Mapping on the top left to enter the Mapping workspace.
- Create a new project using Projects > New Project from the toolbar.
- Give the new project the name
BankRemap, and click Finish.
- Click the Mapping tab, then select Mapping
Specifications under the
BankRemapproject, and select New.
- Give the mapping specification the name
Bank 1 Extract – Remap, and click Save.
- Click the Mapping section within
Bank 1 Extract - Remapand drag the CUSTOMER table within TARGETHOST_1 to the target for the mapping specification. Complete the mapping specification by supplying the following sources from the
bankconnect.BANKDEMO.BANK1schema to the now existing targets. You can provide the sources by expanding the tree on the right-hand side to find an appropriate column, and then simply drag the column over to the source cell in the mapping grid.
Source (BANK1 schema) Target (BANK schema) CHECKING.ACCOUNT_BALANCE CUSTOMER.ACCOUNT_BALANCE CHECKING.ADDR1 CUSTOMER.ADDR1 CHECKING.ADDR2 CUSTOMER.ADDR2 CHECKING.CITY CUSTOMER.CITY CHECKING.CUSTOMER_ID CUSTOMER.CUSTOMER_ID CHECKING.NAME CUSTOMER.NAME CHECKING.STATE CUSTOMER.STATE CHECKING.ZIP CUSTOMER.ZIP setNull() CUSTOMER.ONLINE_ACCESS setNull() CUSTOMER.YEARS_CLIENT setNull() CUSTOMER.GENDER setNull() CUSTOMER.LEVEL CHECKING.SS_NUM CUSTOMERS.TAX_ID
As shown in Figure 1, you have now created a mapping specification that maps a source system to a physical target model that has been imported from InfoSphere Data Architect.
Figure 1. Final mapping specification after creating mappings
The next step is to create a DataStage job. The following instructions will help you to produce a DataStage job from the mapping specification that were created earlier.
- Click the Generate Job button to start the job generation wizard.
- Keep the default name for the new DataStage job, and select the
Jobs folder from the project
dstage1. Clear the Use Shared Table Definitions check box, as shown in Figure 2, for simplicity sake.
Figure 2. Option to use shared table definitions
- Click Finish. In the case that a DataStage job already exists with the same name, simply choose the overwrite option to replace the existing job.
- Start IBM Infosphere Datastage Designer to review the job. If the job
does not appear immediately, then you may need to refresh the view by
clicking Repository > Refresh.
When opening the job, you should see the result shown in Figure 3. As
expected this is a very simple job with a single transform object.
Figure 3. Job after generation from FastTrack
If you see a different result, then you should go back to the mapping specification and verify that everything is mapped correctly. You can click the Validation tab to help you identify potentially unmapped targets.
- Double-click on the Transform object and see the mappings that you
established within FastTrack, as shown in Figure 4.
Figure 4. Contents of the generated Transform stage
So far you imported metadata, created a mapping specification, and generated a DataStage job. This next step will focus on how you can apply a change to the data model back in the data modeling tool. You will then learn how to update the existing metadata in the metadata repository.
- To simulate a change in the metadata definition within TARGETHOST_1,
change the CUSTOMER table's NAME column to
CUSTOMER_NAME within InfoSphere Data Architect.
To perform this change, select the NAME column and go
to the properties in the lower right area of the screen. Change the
name and press enter. As shown in Figure 5, you will see the change
reflected in the tree on the right.
Figure 5. CUSTOMER table after change
- Export this table back to the metadata repository as shown in the previous Import metadata section. Make sure that you specify TARGETHOST_1 as the host name when you export. This is critical since you want to update the existing instance of the metadata in the repository.
- Switch to FastTrack and select the Metadata > Metadata Repository tab. Make sure to click the refresh icon in the upper right corner.
- Expand the TARGETHOST_1 until you see the
CUSTOMERentity. Within the entity you should see the new CUSTOMER_NAME field, as shown in Figure 6.
Figure 6. Metadata repository after change
- The original mapping specification is now invalid because the NAME
column has been altered. This can be displayed by closing and then
Bank 1 Extract -- Remapspecification if it is open. If validation is not enabled, then you have to enable it by clicking the exclamation mark in the right upper corner of the mapping grid. The CUSTOMER.NAME field displays an alert and an explanatory tool tip, as shown in Figure 7.
Figure 7. Warning indication
The change to the metadata requires an update to the existing mapping specifications. In this scenario it would simply be possible to remove the wrong column in the mapping spreadsheet and replace it with the new column that is visible in the metadata browser. In many cases however, many changes occur over a period of time and they are most likely made by a different person in the team. In addition, a column could be used in several places within a mapping specification and therefore would have to be changed in all of these places. A feature called remap helps to simplify this process.
- To update the mapping for this field, right-click the CUSTOMER.NAME field and select Remap. Then, using the Database Metadata window to locate the CUSTOMER_NAME column, drag it into the Matching Replacement column. The screen describes which column in the spreadsheet (on the left) will be replaced with which column from the repository (on the right).
- Click Finish. When you return to the mapping specification window, you will see that the alert on the NAME field has disappeared, and the mapping has been correctly updated.
- Figure 8 shows how the mapping specification has been updated with the
changes that have been made to the schema. Save the mapping
Figure 8. Remap screen
Now that the mapping specification is updated, you need to change the DataStage job as well. A simple way to perform this is to generate a new DataStage job. However, in a real-world scenario, a developer might already have made modifications to the previously generated job. It would be ideal to simply communicate the few modifications to the mapping specification instead of starting over with a new job. FastTrack offers a feature called job annotation that helps you do this.
- To simulate a change to the previously generated DataStage job, simply open the job in the DataStage designer and move one of the stages on the canvas.
- After the layout has been altered, save the job, and ensure that the job is closed in the DataStage designer.
- Now click the Generate Job button to invoke job generation.
- In the job generation wizard, ensure that you select the same job name and the same project. Also, ensure that the option to use table definitions is disabled, and then click OK.
- From the Overwrite existing job window, click
Annotate, as shown in Figure 9.
Figure 9. Job annotation dialog
- Open the DataStage Designer to review the existing job. Notice that the layout change is still visible. This means that the job flow itself has not been modified. Instead, there is an annotation on the canvas that lists the changes between the last job generation and the current state of the mapping specification. All changes are captured in a shared container in the upper-area of the job canvas.
- You can now look at the changes and apply them to the DataStage job. In addition, any subsequent change to the specification can be annotated in the same way to the existing job. In this case there will be one shared container for each time the user performs a generation. These shared containers can also be kept as a change log in the DataStage job.
In many cases there are multiple instances of the same data model in the repository. Reasons for that can be as follows:
- Major changes to an existing data model require phasing in a new model. In this case there are two versions of a model in the repository where the new version has to be slowly adopted by jobs and mapping specifications over time.
- Data models are deployed on multiple systems, for example, by development or production of potentially different database vendors.
Usually mapping specifications or DataStage jobs refer to one of these data model instances and need to be migrated to another. FastTrack can help with this process in a very similar way than what you've seen earlier.
First you will import a second instance of an existing data model into the metadata repository.
- Create a new metadata host to hold the second data model. Repeat the
steps outlined in the Import metadata section
at the beginning of this article, this time giving the new host the
- Within InfoSphere Data Architect, update the CUSTOMER table. Change
ADDR1 to ADDRESS_LINE_1, and ADDR2 to ADDRESS_LINE_2, and change the
maximum length of the ADDRESS_LINE_1 column to 100
characters. Change the CUSTOMER_NAME field back to NAME. And finally,
delete the LEVEL column, and add a COUNTRY column. As shown in Figure
10, this allows you to simulate how to resolve mapping issues which
arise as a result of the typical data modeling operations of addition,
alteration, and removal of columns.
Figure 10. Schema after modification
- Make sure to save the modified model in InfoSphere Data Architect.
- Once again, export to the metadata repository, but this time, export the Database Model.dbm to the TARGETHOST_2 host.
- Within the FastTrack Metadata Repository tab, you can see that the TARGETHOST_1 and TARGETHOST_2 are now populated – each with a slightly different version of the same model.
The next step is to adopt the second model by your existing mapping specification.
- Click the Mapping tab and, if necessary, reopen the
Bank Remap > Bank Extract 1 Mapping
Specification. This time none of the target mappings
display an alert. This is because they still exist in TARGETHOST_1. To
transfer these target mappings to TARGETHOST_2, select all the columns
in the target then right-click and select Remap.
Figure 11. Remap launch point
- As most of the fields have not changed, drag the CUSTOMER table from
the TARGETHOST_2 host within the Database Metadata
window to the equivalent CUSTOMER row within the Matching
Replacement column. The application will try to match all
columns from the new table with the existing columns by name, which
populates most of the rows.
Those rows that have changed: ADDR1, ADDR2, CUSTOMER_NAME, and LEVEL are not populated and display an alert. To rectify this, simply drag the ADDRESS_LINE_1, ADDRESS_LINE_2 and NAME fields from the TARGETHOST_2 target metadata to the relevant field within the Matching Replacement column.
You can also leverage the discover functionality as another option to find matching columns for non-existing columns. In the lower properties area, you can use the discover functionality to find replacement columns by name.
- Since the LEVEL column has been removed from the CUSTOMER table, to resolve this alert, it should be removed from the mapping specification. Click Finish to return to the mapping specification, and then right-click the row number to the left of the id field to select the entire row, and then click Delete from the context menu.
- Now all existing columns have been remapped to new columns of the
second model version. Remember that you also added the COUNTRY column
to the CUSTOMER table. In order to bring any new columns into the
specification, you can drag the entire CUSTOMER table into the target
field column of the last empty row. Note that the editor will now only
add the new columns that have not yet been mapped. This will help you
to identify any additions to the schema without having to manually
compare all elements of the specifications. In this example, the
COUNTRY column is the only new column to be added to the table. As
shown in figure 12, since there is no equivalent Country column within
the CHECKING table, in order to complete the specification, you need
to hard code the value "USA" within the Function column so that all
rows that pass through the transform are assigned this value.
Figure 12. Final mapping specification
- Now you can again regenerate, and either overwrite or annotate the existing DataStage job.
This article demonstrated how to use and update external models, forward changes into your existing assets, and how to work with multiple versions of a database model. It also showed how to use the more advanced remap and job annotation features from FastTrack. If you are currently using the InfoSphere toolset, then this article gives you some new ideas on how to improve your daily workflow.
|Sample data models for this article||ft-changemanagement-data||2KB||HTTP|
- Learn more about InfoSphere FastTrack.
- Learn more best practices on other IBM
InfoSphere Information Server products
- See an introduction to
- In the InfoSphere area on developerWorks, get the resources you need to
advance your InfoSphere Information Server skills.
- Learn more about InfoSphere Data Architect on developerWorks.
Get products and technologies
- Participate in the discussion forum.
- Get involved in the My developerWorks community.
Connect with other developerWorks users while exploring the
developer-driven blogs, forums, groups, and wikis.
Stefan Eberl is a software engineer based in the IBM Silicon Valley Laboratory. He is currently working on the tooling for IBM's Information Management products. In particular he focuses on metadata mapping, and is the architect for IBM InfoSphere FastTrack. Mr. Eberl graduated in 2004 from the Fachhochschule Furtwangen in Germany.