This article is the first in a series of articles that address data integration aspects in the development of composite business services (CBS). A CBS is a collection of related and integrated business services that support a business process built on an SOA. Many enterprises have multiple repositories of customer data to support their different channels or lines of business.
Following a sample banking scenario originally described in "Building SOA composite business services, Part 1" (developerWorks, Jun 2006), you see how to perform the initial load of data into the WCC repository.
Scenario: Banking information integration
Consider the following hypothetical scenario. As part of their customer data integration (CDI) initiative, Jivaro Bank has just purchased the WCC product to store their customer name and address information. In this scenario, Jivaro is going to continue to use their own legacy repository for their other banking data, such as account and segment information, but that data could have been migrated to WCC. By consolidating their customer name and address data into WCC, Jivaro is able to eliminate the multiple channel repositories that currently contain this data, and is able to provide a common set of services to access that data from all areas within the company. Once the customer name and address data has been migrated to WCC, Jivaro will use an information service to retrieve both the legacy Jivaro credit information and the WCC customer information for display in the Customer Profile portlet. The WebSphere Information Server Directory (WISD) capabilities of IIS are used to create that information service (see Figure 1 below). The use of WISD is not covered in this article, but will be the topic of a future article in this series.
Figure 1. Integration of WCC repository into Jivaro
The integration of this CDI software is accomplished by executing a sequence of steps:
-
Load legacy data into the WCC repository
- The first step involves moving the customer name and address data from the existing Jivaro data stores into the WCC repository. Any ETL tool can be used to load the WCC repository, but this task is made simpler by using WebSphere DataStage, since IBM provides High Volume Load (HVL) DataStage jobs that handle the insertion of legacy data into the WCC repository. DataStage is a module in the Information Server suite of products. When using the HVL jobs, the only task left to the practitioner is the task of producing the HVL input file containing the data to be loaded. Producing that input file is one of the topics covered in this article. The Jivaro data is stored in DB2, but DataStage supports a virtually unlimited number of data sources including text files, complex data structures in XML, ERP systems such as SAP and PeopleSoft and almost any database (including partitioned databases), Web services, and business intelligence tools like SAS.
-
Create a WCC portlet
- The second step involves using the WCC Data Stewardship User Interface to replace the Jivaro customer profile portlet with a new portlet. The new portlet provides access to the more complete set of customer profile attributes that the WCC repository provides.
-
Extend WCC services
- The third step involves extending the WCC-provided services to access the Jivaro-specific information that isn't addressed by the WCC services. Jivaro bank will create a new service to allow searching on a Jivaro-specific customer ID field within the WCC repository.
Steps 2 and 3 are described in the second article in this series.
Load legacy data into the WCC repository
There are three tasks involved in loading the Jivaro data into the WCC repository.
- Map the legacy data to the WCC HVL record formats.
- Create a DataStage job to produce a file containing the legacy data in the HVL format.
- Running the HVL job sequences to load the WCC repository.
The "Map legacy data to the WCC HVL formats" section describes how to approach the mapping effort, using mappings from the Jivaro database as examples. The "Create the DataStage job to produce the HVL input file" section describes the Jivaro DataStage jobs that were created in order to produce the load file with the Jivaro data. The "Running the HVL job sequences" section describes how to run the HVL-provided load jobs.
1. Map legacy data to the WCC HVL formats
WCC provides a set of HVL jobs that use WebSphere's DataStage to load the WCC repository with legacy data. These jobs require a single file as input. This file should contain all of the legacy data to be loaded, in the proper record layouts defined by the HVL documentation. HVL has defined three types of data, called record types, that can be loaded into WCC: contact, organization, and contract. For each of these record types, there are associated "subtypes" that further break down the information in each of the record layouts. You must map your legacy data to the file formats defined for these record types/subtypes and provide a single file as input to the WCC DataStage load job. See Figure 2 for the types/subtypes supported.
Figure 2. HVL record types and sub types
The order of the records within the input file provided to the HVL job is critical. The records must be sorted by customer and then by record type/subtype. This sort sequence assures that the proper primary and foreign keys are created in the WCC database. For additional guidelines, refer to the HVL documentation.
Mapping the legacy data to the HVL file formats can be accomplished by following a sequence of steps:
-
Map obvious relationships, such as commonly recognized data elements
- Street and city in Jivaro map to addr_line_1 and city.
-
Map non-obvious relationships by reviewing the WCC HVL documentation to understand the content of each input record.
- Legacy Jivaro data contains three phone numbers. A phone_number field does not exist in the HVL layouts. HVL specifies a REF_NUM field in the LocationGroup_ContactMethodGroup_ContactMethod file that stores a contact number, phone or email, and a corresponding code value in the input file that is set based on the type of contact number being provided, such as 1 for home phone, 7 for personal email.
-
Translate legacy data literal values to WCC integer code values.
- WCC contains more than 90 reference tables that hold integer code values and their descriptions. Integer code values are required in many of the input file formats.
- Marital status is a literal value in Jivaro, yet it is a code value in WCC. In order to map that properly, you must lookup marital status in the WCC reference table, find a match on the description, and retrieve the appropriate code. This code is then specified in the HVL input file.
-
Insert new code values into the WCC reference tables, when required to represent the legacy data.
- The WCC reference tables contain a small set of sample reference values that might not cover the needs of all legacy systems.
- Jivaro contains the values "Regular" and "Premium" to represent the segment that a customer belongs to. In WCC, there is a code for "client importance", which is seeded with the sample values "Bronze", "Platinum", "Gold", and "Silver." Prior to loading the Jivaro data, two additional code values need to be added to the reference table, and the appropriate code value provided in the input file. In addition, the sample values could be deleted if they were never going to be used in Jivaro.
-
Map the legacy data that does not have an obvious mapping to the WCC entities.
- For all remaining unmapped fields, investigate the additional WCC entities that are designed to handle miscellaneous and legacy-specific data. Determine if there is a "best fit" choice for mapping the legacy element. If a fit cannot be found, a WCC entity could be "extended", meaning altered to add an additional column, to handle the legacy data. If an extension is required, that data would need to be loaded manually, not through the HVL job, and performance considerations would need to be addressed. It might be necessary to add an index on that new field if search capability is going to be provided.
- Jivaro contains an alphanumeric customerId element that is an identifier that each customer uses to identify themselves to the bank. The Jivaro customerId cannot be mapped to the WCC contact ID, because contact ID is a randomly generated number. The Contact table could be extended to include an additional column, or another WCC entity could be used to store that value. Jivaro used the WCC entity called "Contact Equivalent" that is designed to store customer identifying information. Since this field is needed for search capabilities, the WCC search services have been extended to search on this field. This service extension is described in the second article in this series.
2. Create the DataStage job to produce the HVL input file
This section describes the Jivaro usage of the WCC-provided HVL jobs as well as the jobs that were created in order to produce the HVL load file with the Jivaro legacy data.
The first step is to import the WCC DataStage project into a new project in DataStage. To accomplish this, the HVL instructions that were provided by WCC were followed.
The next step is to create a job sequence within DataStage to produce the single HVL input file containing the legacy data to be loaded into WCC.
To create the Jivaro job sequence, three tasks must be executed (see Figure 3 below):
- Create a parallel job called RetrieveAllCustomerData that produces a single output file for each record type/subtype required to represent the legacy Jivaro data.
- Using a job sequence stage call CMDtoMergeFiles, append the files into a single file.
- Create a second parallel job called FormatWCCInput to sort the single file into the order required by the HVL load job.
Once the single load file containing Jivaro legacy data is created, run the HVL job sequence provided by WCC to load the WCC repository.
Figure 3. Job sequence to produce the HVL input file
Begin by creating a parallel job to retrieve the legacy data from Jivaro, transform that data as necessary, and store that transformed data in a series of flat files, one per record type/subtype (see Figure 4 below). As part of this job, you need to add lookup stages for the data elements that have to be translated from literals to code values. Then use a transform stage to parse the data into the individual record type/subtype formats that were identified as part of the mapping exercise. Design the transform stage to create a separate output link per record type/subtype, and where necessary, create multiple instances of a record type.
Figure 4. RetrieveAllCustomerData parallel job to retrieve legacy Jivaro data
The output of the parallel job is 11 flat files, each with their own record layouts. These files need to be combined into a single file, which is then sorted in the second parallel job. It was found that the simplest way to merge all of the files into a single file is to use the Execute Command stage of a job sequence. This example is running on a Windows operating system, and the TYPE command is used within the Execute Command stage to append all of the files together (see Figures 5 and 6).
Figure 5. Beginning of the TYPE command
Figure 6. Target of the TYPE command is the file allFileInput
While this approach is the simplest, it has a couple of drawbacks. First, the Execute Command issues a platform specific command. This needs to be tailored if running in an AIX environment. Second, parameters are not allowed in the Execute Command stage, so the path of each of the files had to be hardcoded. You'll notice in Figures 5 and 6 that the path of c:\dsland\src needs to be provided for each of the files being appended into the target file. If this job was to move to another DataStage server, this Execute Command stage would have to be modified to specify the correct path for each of the files.
The final step is the creation of a parallel job that took the file created in the Execute Command as input, passed the data through a sort stage to sort by customer ID, record type, and record subtype, then store the data in a single flat file. Then specify this file as the input file to the HVL load job (see Figure 7).
Figure 7. FormatWCCInput
Once the HVL input file is created, you are able to run the HVL job sequence to process that legacy data.
HVL provides two job sequences. One sequence loads the data directly into the WCC repository (database mode), the other creates load files that you can process using an SQL command to load the data into WCC (file mode). Both sets of sequences execute the same set of validation jobs. See Figure 8 for a snapshot of the DBLoadJobFlow sequence, which loads the data directly into the WCC repository.
Figure 8. DBLoadJobFlow sequence that processes the Jivaro input file
The HVL sequences invokes a ParseInput job that reads the input file provided, and parses it into individual data sets containing the data for each record type/subtype (see Figure 9).
Figure 9. ParseInput HVL job
The remaining jobs in the job sequence validate and format the data for each of the data sets. There is a validation job per data set and the flow of each of those jobs is very similar. The first step involves assigning a key value for the data, then each of the reference code values are validated against the WCC reference tables. If any errors are encountered during the processing of a particular data set, the error is written to a log file, and the customer key value is stored in a file that can be used later to delete that customer's data from the WCC database so that only complete customer records are loaded into WCC. Once the customer data in error is fixed, that data can be re-run through the HVL job and subsequently loaded into WCC.
When loading the data into the WCC database, the order of loading the tables is important due to referential integrity constraints. It is necessary to properly order the loads if loading manually, or if inserting directly through DataStage there is the requirement to drop all foreign keys prior to running the DataStage job and then re-establishing them after. This is documented in the HVL guide.
A best practice identified was to first run the WCC DataStage job in file mode to ensure that all of the load files were properly created. Then create an SQL job that could clean the WCC tables by removing only the data that was inserted through the DataStage job. This enables you to run the WCC DataStage job in database mode repeatedly.
CDI, or the "360 degree view of the customer", are terms that refer to the ability to provide a complete view of the customer across brand, channel, and company. WCC accomplishes this by providing a repository that is capable of storing that data along with a framework for accessing and extending that data. WCC simplifies the migration of legacy data into the WCC repository by providing DataStage jobs to assist with that task. A five-step process was identified to map legacy data to the HVL file format, and a three-step DataStage job sequence was identified to produce the legacy input file. As a result of Jivaro's integration with WCC, Jivaro now has a consolidated, multi-channel customer repository that is accessible through a common set of services provided by WCC. Read the second article in this series for details on how to create a WCC customer profile portlet, how to integrate the WCC services into the Jivaro environment, and how to extend the WCC-provided services.
| Description | Name | Size | Download method |
|---|---|---|---|
| Demo 1 | 15MB | HTTP | |
| Demo 2 | 17MB | HTTP | |
| Demo 3 | 13MB | HTTP |
Information about download methods
Learn
-
"Building Composition Business Services, Part 1"
(developerWorks, Jun 2006): Create composite business services. Each article in this series documents how to use a CBS-enabling technology in the context of a Software-as-a-Service hosted banking environment.
-
WebSphere Customer Center: Find product information and downloads. Specifically, find a guide for DataStage WCC load jobs: "IBM IIS High Volume Load (HVL) Solution Installation Instructions For IBM WebSphere Customer Center" (WCCHVLSolutionInstallInstructions.pdf), as well as a data dictionary for WCC: (DWLCustomerDataDictionary.pdf), available when you download the product.
-
Visit the Information Integration page on developerWorks to read articles and tutorials on the IBM Information Server and connect to other resources to expand your information integration skills.
- Browse the
technology bookstore
for books on these and other technical topics.
-
developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
-
Stay current with developerWorks
technical events and webcasts.
Get products and technologies
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.







