Extending the Basic Initial Load in InfoSphere Master Data Management Advanced Edition

Basic Initial Load (BIL) Extension Guidance

The Master Data Management (MDM) Advanced Edition (AE) Basic Initial Load (BIL) DataStage assets can be successfully extended. The BIL DataStage project presents a patterned and disciplined approach to the initial loading of data. This article highlights, in an easy-to-follow fashion, how to utilize the existing structures to successfully add additional tables to the types of data that BIL currently loads into the MDM data base.

Share:

Paul Flores (pflores@hotmail.com), Senior Software Engineer, IBM

Paul Flores photoPaul Flores is a Senior Software Developer and Application Architect, and is an IBM Certified Solution Developer for InfoSphere DataStage v8.5. His career spans more than 20 years in various Information System disciplines, working with companies such as Sandia National Laboratories, Intel and Acxiom. The majority of his career has centered around the development of software in support of diverse areas that span research and manufacturing. Paul joined IBM in 2008, and has been a part of the MDM Basic Initial Load (BIL) Development team since its inception. He holds a Bachelors degree in Mathematics and a Masters degree in Computer Information Sciences. He presented at the 2012 Information On Demand (IOD) conference a Hands On Lab to specifically walk through the extension of BIL.



03 July 2013

Also available in Chinese

Introduction

The primary purpose of the MDM Basic Initial Load (BIL) DataStage project is to rapidly insert data into a predefined set of core MDM database tables. While this set of database tables includes both Party and Contract domains, they represent only a portion of the total number of MDM database tables.

The definition of the records to load these core database table records makes up the standard interface formatted (SIF) files. By design, BIL utilizes a component structure for the handling of the SIF files to load data into the core MDM database tables.

This article provides a technical overview of the design of BIL, providing you detailed guidance on how to take advantage of the component style and structure of BIL to extend the functionality. You are guided through the steps on how to extend BIL to load additional data elements and data tables.


Technical overview

BIL consists of the following seven distinct functional areas: Import, Validation, Preparation, Data Quality Error Consolidation, Assign Identifiers / Insert, Data Quality Error Report Generation and Clean Up.

The invocation of many of the functional areas are mutually exclusive. Validation and Preparation functionality are driven primarily by the job parameter. If the parameter is set, one or more of the assets will not be invoked. Data Quality Error Consolidation is driven by the presence of error messages. Data Quality Error Report Generation is only run if the Data Quality Error Consolidation is invoked. Assign Identifiers / Insert is only run if there are no error messages present.

Clean Up functionality is run independently from the other areas. The Clean Up functionality is used to archive, then delete, intermediate data sets and any associated error message files. The Clean Up functionality utilizes a wild card naming convention to identify both intermediate data sets and error message files, and as such, it is not in need of adjustment if the other functional areas are extended.

Figure 1 shows the Basic Initial Load (BIL) functional area and their interrelationships.

Figure 1. BIL functional flow
A flowchart depiction of the overall functional flow of BIL.

Extension guidance

This section consists of extension guidance for the following functional areas: Import, Validation, Preparation, Data Quality Error Consolidation, Assign Identifiers/Insert, and Data Quality Error Report Generation.


Import functionality

At a high-level, the import functionality is an implementation of a pattern. This pattern consists of the following steps.

  1. Read delimited data into a single text data field until there are no more records to read.
  2. Split or extract record type identifiers.
  3. Split or extract identifiers.
  4. Use record type identifier to identify record metadata.
  5. Split or extract record columns from text data field to populate record with data properly typed.
  6. Identify and manage records with duplicate record and business key identifiers.
  7. Write records to properly named data store for downstream processing.
  8. Capture and record all errors with formatted error messages.

The pattern, as described, lends itself to implementation utilizing any number of tools and techniques. The flowchart shown in Figure 2 shows a representation of this pattern.

Figure 2. Import pattern flowchart
This figure shows a depiction of the import process as a flowchart.

BIL Data Stage import functionality

In regards to BIL, the import functionality pattern is expressed in terms of the MDM Standard Interface Format (SIF) and the defined record types and record sub-types (RT/ST). The following captures the flow of the import functionality implementation in BIL utilizing DataStage.

  1. Read in SIF File.
    1. Use file name pattern of "*.sif".
    2. Capture any associated read errors.
    3. Append SIF file name and record number to incoming text record.
  2. Extract text record identifier fields.
    1. Party Sub Types: Person - "P", Organization - "O", Organization Name - "G", Person Name / Person Search - "H", Location Group / Address Group / Address -"A", Location Group / Contact Method Group/ Contact Method - "C", Identifier - "I", Line of Business Relationship - "B", Contact Relationship -"R", Privacy Preference Entity / Privacy Preference - "S", Miscellaneous Values - "M", Alert - "T".
    2. Contract Sub Types: Contract -"H", Contract Component - "C", Contract Role - "R", Role Location - "L", Contract Component Value -"V", Miscellaneous Values - "M", Alert - "T".
  3. Extract source record identifiers.
    1. ADMIN_SYS_TP_CD
    2. ADMIN_CLIENT_ID
    3. ADMIN_CONTRACT_ID
    4. ADMIN_CLIENT_OR_CONTRACT_ID
    5. Capture any associated errors
  4. Based on RT/ST, send incoming record to extract record column data. The current implementation is a RT/ST specific shared container.
    1. Identify record type sub-type business keys.
    2. Capture errors encountered in extracting record column data.
  5. Sort on incoming data record business keys. This is critical to identify incoming duplicate records.
  6. Drop incoming records with duplicate data record business keys. Format duplicate business record error message.
  7. Output data set of the parsed RT/ST data records with the following naming conventions:
    1. Prefix data set name with "Parse_".
    2. Party data set names: Person / Organization - "Contact", Organization Name - "OrgName", Person Name / Person Search - "PersonName", Location Group / Address Group / Address -"Address", Location Group / Contact Method Group/ Contact Method - "ContactMethod", Identifier - "Identifier", Line of Business Relationship - "LOBRel", Contact Relationship -"ContactRel", Privacy Preference Entity / Privacy Preference - "PrivPref", Miscellaneous Values - "MiscValue", Alert - "Alert".
    3. Contract data set names: Contract - "Contract", Contract Component - "ContractComponent", Contract Role - "ContractRole", Role Location - "RoleLocation", Contract Component Value - "ContractCompVal", Miscellaneous Values - "ContractMiscValue", Alert - "ContractAlert".
    4. Data set name suffix: Job Batch ID.
    5. File type identifier: "ds".
  8. Output data set consisting of source record identifiers.
    1. Party: data set name: "RI_Contact_SUBSET", data set name suffix: Job Batch ID, and file type identifier: "ds".
    2. Contract: data set name: "RI_Contract_SUBSET", data set name suffix: Job Batch ID, and file type identifier: "ds".
  9. Collect both formatted and unformatted error messages. Format unformatted error messages:
    1. File processing errors
    2. Error creating column data
    3. General processing errors
  10. Output error message text files.
    1. File processing error message file. Prefix file name with:
      1. Record Type: PARTY_ or CONTRACT_
      2. File name "SIF_Import_Recordization_ERR_MSGS"
      3. File name suffix: Job Batch Id
      4. File type identifier: "txt"
    2. Import processing error messages. Prefix file name with:
      1. Record Type: PARTY_ or CONTRACT_
      2. File name "SIF_Import_ERR_MSGS"
      3. File name suffix: Job Batch Id
      4. File type identifier: "txt"
    3. General processing error messages. Prefix file name with:
      1. Record Type: PARTY_ or CONTRACT_
      2. File name "SIF_Import_IID_ERR_MSGS"
      3. File name suffix: Job Batch Id
      4. File type identifier: "txt"

Import functionality extension guidance

Import functionality can be extended to address additional data tables, additional record type sub-type (RT/ST) record columns, and changes to an incoming RT/ST column metadata. Each of the changes are addressed in the following subsections.

Extension guidance: Additional data tables

The addition of data tables requires the following tasks to be addressed and completed:

  1. Define data table SIF format.
    1. Identify record type (RT).
    2. Identify sub-type (ST).
    3. Insure no overlap with an existing record type sub-type (RT/ST) designation.
  2. Add RT/ST definition in the BIL DataStage Import Functionality section.
  3. Copy and rename import shared container(s) to reflect the new data table(s).
    1. Change and/or add import column definitions.
    2. change business keys in key sort so that duplicate records can be caught early.
    3. Insure error messages are properly numbered and text is consistent.
    4. Insure that column metadata is consistent from stage to stage.
    5. Follow the naming convention for both error message files and resulting data set.
    6. Change name of the links especially the input and output links.
  4. Make an archival copy of the BIL import job that the new RT/ST is going to be added to.
  5. Add new RT/ST to tx_SIF_Splitter transformer in the related import job (BL_010*), adding an output link to connect new import shared container.
  6. Connect shared container output links as patterned by existing shared container output links to the local containers.
  7. Insure new local container links are attached as patterned within each local container.

Extension guidance: Column addition to a RT/ST record

The following tasks describe the addition of a column to an existing RT/ST record. The tasks are all performed in the related RT/ST import shared container.

  1. Add column to the column import stage.
  2. Add column to all related stage column metadata (output column).

Extension guidance: Changes to RT/ST column metadata

The following tasks describe the changes to column metadata of an existing RT/ST record. The tasks are all performed in the related RT/ST import shared container.

  1. Change column metadata in the column import stage.
  2. Makes changes to the column metadata in all related stage column metadata (output column).

Validation functionality

At a high level, the validation functionality is an implementation of a pattern. This pattern consists of the following steps:

  1. Read in the RT/ST specific records that have been imported from a SIF file(s).
  2. Look up specific type code values against existing code tables capturing results of whether or not the incoming code value is found in the related code table.
  3. Examine type code look up results: Business rules identify those type code that must exist and those that may exist.
  4. Capture business rule related errors:
    1. Mark records with errors.
    2. Format related error message.
    3. Output formatted error messages to a text file.
  5. Drop records in error:
    1. Format related error message.
    2. Output formatted error messages to a text file.
  6. Output records that have no errors:
    1. If necessary additional functionality can be performed prior to records being written.
    2. Data store naming convention either identifies that the records have been validated or to suit downstream processing.

The pattern, as described, lends itself to implementation utilizing any number of tools and techniques. The flowchart shown in Figure 3 is a representation of this pattern.

Figure 3. Validation pattern flowchart
A depiction of the record validation process as a flowchart

BIL DataStage Validation functionality

In regards to BIL, the validation functionality pattern is expressed in terms of the defined record types and record sub-types. The following captures the flow of the import functionality implementation in BIL, utilizing DataStage.

  1. Read in RT/ST specific data set:
    1. Data set name prefix: "Parse_".
    2. Data set name: as described in section BIL DataStage Import functionality step 7.
    3. Data set name suffix: Job Batch ID.
    4. File type identifier: "ds".
  2. Use custom client validation logic: Implementation is a RT/ST specific edit point shared container.
  3. Set up type code look up result indicator.
  4. Perform type code look ups: Implementation consists of database code table-specific shared containers.
  5. Examine type codes value:
    1. Some fields can be null but if populated must be valid.
    2. If error: Format error message, output error message, and mark record in error.
  6. Examine required fields:
    1. Some fields can not be null.
    2. If error: Format error message and mark record in error.
  7. Examine other RT/ST specific business rules. For example, some dates must be in proper chronological order. If error: format error message and mark record in error.
  8. Drop records in error: Format error message and drop record out of processing stream.
  9. Remove fields used for evaluation from processing stream.
  10. Output data set of validated RT/ST data record with naming convention:
    1. Data set name prefix: "Valid_". Exceptions to accommodate Data Quality Error Consolidation include: ContactRel: "ErrCon_ContactRel_0" and Identifier: "ErrCon_Identifier_0".
    2. Data set name: as described in section BIL DataStage Import functionality with exceptions noted previously.
    3. Data set name suffix: Job Batch ID.
    4. File type suffix: "ds".
  11. Consolidate and output error messages with naming convention:
    1. Data set name prefix :PARTY_ or CONTRACT_.
    2. Data set name: as described in section BIL DataStage Import functionality step 7.
    3. Data set name suffix: "_VS_ERR_MSGS".
    4. File type identifier: "txt".

Validation functionality extension guidance

Validation functionality can be extended to address additional data tables, additional record type sub-type (RT/ST) record columns, and changes to an incoming RT/ST column metadata. Each of the changes are addressed in the following sections.

Extension guidance: Additional data tables

The addition of data tables requires the following tasks to be addressed and completed. The task that follow cover the general case for most tables. If the new tables has foreign keys then additional detailed task are required and follow these general tasks.

  1. Identify RT/ST type code fields.
    1. Review existing SIF file definitions to see if type code in use: look at database shared container used in RT/ST job look up.
    2. If not in use identify related type code table that can used to validate, copy an existing database container and the naming convention is ILDBSEL<Code Table Name>, and use existing error message error numbers.
  2. Identify RT/ST business rules.
    1. Identify mandatory fields.
    2. Identify date specific rules.
    3. Identify default value (if appropriate).
  3. Optional: Copy and rename, or create an edit point shared container, and consider the following:
    1. The edit point shared container is purely optional.
    2. Follow same naming convention.
    3. Address input and output link metadata.
  4. Copy and rename an existing validation shared container by doing the following:
    1. Choose a validation shared container that has similar type codes.
    2. Use Save As functionality before editing.
  5. Alter new validation shared container, and consider the following:
    1. Insure type code validations are performed.
    2. Alter main transformer to handle business rules.
    3. Review error message identifiers and message text.
  6. Copy and rename an existing validation job (BL_02*_VS). Use Save As functionality.
  7. Correct data set input stage point to new "Parse_" data set, and consider the following:
    1. Correct data set name, leaving remainder of naming pattern intact.
    2. Suggested: add columns definitions to output tab.
  8. Handle edit point shared container.
    1. If an edit point shared container is to be used, then add it and validate input and output links.
    2. If an edit point shared container is not used, then remove shared container and add a column generator stage to add the ERR_ON_ROW_USER_EXIT column, TinyInt data type with a default value of 0.
  9. Use the validation shared container defined and altered previously to have validated input and output links.
  10. Correct data set output stage and ensure the following:
    1. Data set name prefix: "Valid_".
    2. Data set name as described in section BIL DataStage Import functionality.
    3. Data set name suffix: Job Batch ID.
    4. File type identifier: "ds".

If a new data table contains foreign keys/references to other Party or Contract records, then it is recommended that logic be added into the new validation shared container consisting of a look up against either the RI_Contact_SUBSET or RI_Contract_SUBSET.

Extension guidance: Column addition to a RTST record

Add column to all related stage column metadata (output column) including associated shared or local containers.

If column is a new type code field or is associated with specific business rules then do the following:

  1. Follow look up pattern in the related RT/ST validation shared if it is a type code field.
  2. Augment related transformer logic to handle new look up indicator or business rule utilizing a stage variable.
  3. Insure that any related error message error codes and description are appropriate.

Extension guidance: Changes to RT/ST column metadata

Makes changes to the column metadata in all related stage column metadata (output column) including all associated shared or local containers.

Preparation functionality

Preparation functionality renames "Parse_ " data sets to be processed by down stream processes, either Data Quality Error Consolidation or Assign Identifiers and Insert. Care must be taken with the use of the parameter to bypass validation and is only recommended if the incoming data is from a source where the data quality is well controlled.

There is little to these particular functional areas. The only area that would need to be addressed is when a new RT/ST is added, and in that regard, the existing job sequences provide ample examples of how to rename a data set. There are other considerations that overlap with Data Quality Error Consolidation that are discussed in the next section.

Data Quality Error Consolidation functionality

As with the previous functional areas, with the exception of Preparation, this area follows a pattern.

  1. Read in error message files.
  2. Extract record identifiers of records in error.
  3. Set up data set of record identifiers in error.
  4. Read in data records.
  5. Set up data set of records to examine.
  6. Join foreign keys of records to be examined with record identifiers in error.
  7. Examine join set.
    1. If join set is not empty, then remove records that appear in join set from records to be examined, add record identifiers to the data set of record identifiers in error, and format and output an error message (record in association with record in error dropped). Repeat the foreign keys step.
    2. If join set is empty, then continue to the next step.
  8. Join primary keys of records to be examined with record identifiers in error.
  9. Examine join set.
    1. If join set is not empty, then remove records in join set from records to be examined, add record identifiers to the data set of record identifiers in error, and format and output related error message (record in association with record in error dropped). Repeat the join foreign keys step.
    2. If join set is empty, then continue to the next step.
  10. Output error message file.
  11. Output remaining records to a data set.

This pattern is represented in the flowchart shown in Figure 4.

Figure 4. Data Quality Error Consolidation Pattern Flowchart
This figure shows a depiction of the record validation process as a flowchart

BIL DataStage Data Quality Error Consolidation functionality

In regards to BIL and the data quality error consolidation functionality, the following captures the flow of the data quality error consolidation functionality implementation in BIL utilizing DataStage. Currently the consolidation of records with data quality errors is performed by a job sequence consisting of a loop that repetitively processes data that handles foreign key references of records in error, and another job that handles primary key references of records in error. The following flow is a combination of more than one DataStage job.

  1. Check to see if the error reason error message table exists using related orchadmin command (orchadminCheckDataSet) with the following parameters: the MDMIS FS_DATA_SET_HEADER_DIR parameter concatenated with "ErrReasonErrMsg_tbl.", MDMIS BATCH_ID parameter and ".ds".
  2. If data set exists, then continue with step 8, otherwise continue.
  3. Connect to the ERRREASON database table returning the following data values:
    1. ERR_MESSAGE_TP_CD
    2. ERR_REASON_TP_CD
    3. ERR_SEVERITY_TP_CD
  4. Output the ErrReasonErrMsg_tbl data set using the MDMIS FS_DATA_SET_HEADER_DIR parameter concatenated with "ErrReasonErrMsg_tbl.", MDMIS BATCH_ID parameter and ".ds".
  5. Read in the MDMIS EC parameter set, then use the file name pattern consisting of #FS_PARAM_SET_DIR#/MDM_EC/DEFAULT.
  6. Set error severity values read in from the ERRREASON table for those error messages specified in the default.
  7. Write out an MDMIS EC parameter set named VOLITILE by using the file name pattern of #FS_PARAM_SET_DIR#/MDM_EC/VOLATILE.
  8. Read in Error Message files using file name pattern consisting of:
    1. The MDMIS FS_ERROR_DIR parameter pointing to directory where error message files are written.
    2. Record type: PARTY* and CONTRACT*.
    3. Error Message type: validation: VS_ERR_MSGS, relational integrity: RI_ERR_MSGS, and import: SIF_Import_IID_ERR_MSGS.
    4. MDMIS BATCH_ID parameter.
    5. File type identifier: "txt".
  9. Using the error message error reason code (ERR_REASON_CD) field, look up the related error message type code (ERR_MESSAGE_TP_CD) and error severity type code (ERR_SEVERITY_TP_CD) in the ErrReasonErrMsg_tbl data set concatenating them to the incoming record.
  10. Combining all the input error messages output in a combined error message file, including error message file name pattern, where the MDMIS FS_ERROR_DIR parameter concatenated with the record type (PARTY or CONTRACT) "_ErrCon" concatenated with the MDMIS BATCH_ID parameter and the file type identifier of ".txt". Then examine each incoming error message against the related MDMIS parameters setting the error severity value as follows:
    1. DS_DETECTED_DUPLICATES_ACTION - Action to take if duplicate detected based on REC_TYPES_DUP_CHECK_LIST - E-Error all duplicates (default) / K-Keep first, error others.
    2. DS_FAILED_COLUMNIZATION_ACTION - Action if ANY row fails columnization - F-Fail (default) / C-Continue.
    3. DS_FAILED_RECORDIZATION_ACTION - Action if ANY row fails recordization. Warning Setting to C may break the row counter - F-Fail (default) / C-Continue.
    4. DS_PARTY_DROP_SEVERITY_LEVEL - Party will be dropped if there are errors with severity <= DS_PARTY_DROP_SEVERITY_LEVEL - 4 (default).
  11. Output the Internal IDs of the records to be dropped into a data set. Insure that only those records with a severity suitable for dropping are gathered and only the maximum severity for each associated internal ID is output, with a data set naming pattern of the following:
    1. the MDMIS FS_DATA_SET_HEADER_DIR parameter concatenated with record type (Party or Contracts) "ToDrop_0" concatenated with the MDMIS BATCH_ID parameter and the file type identifier of ".ds".
    2. *Top of loop increment counter is set to zero and run till counter reaches the MDMIS DS_DROP_MAX_ITERATIONS parameter value.
  12. Read in the Internal IDs of the record to be dropped.
    1. Use data set naming pattern consisting of the concatenation of the following, in the order presented (the MDMIS FS_DATA_SET_HEADER_DIR parameter).
    2. Record type: Parties and Contracts.
    3. "ToDrop_".
    4. Loop increment value.
    5. MDMIS BATCH_ID parameter.
    6. File type identifier: ".ds".
  13. Sort the Internal IDs of the records to be dropped.
  14. Set up join keys for record type foreign keys as follows:
    1. Record type
      • Party Tables and Foreign Key, including Contact: PRVBY_INTERNAL_ID, ContactRel: INTERNAL_ID_FROM, and Identifier: ASSIGNEDBY_INTERNAL_ID.
      • Contract Table and Foreign Key, including Contract: REPLBY_INTERNAL_ID.
  15. Read in record type data set as follows:
    1. Use data set name pattern consisting of the concatenation of the MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. "ErrCon_"
    3. Record type:
      • Party Tables, including Contact, ContactRel, and Identifiers.
      • Contract Table, including Contract.
    4. The Nth job parameter of the loop increment counter.
    5. The MDMIS BATCH_ID parameter.
    6. The file type identifier of ".ds".
  16. Join the record type Internal IDs of the foreign keys to the internal IDs of the records to be dropped.
  17. Process record join set. If join was successful, then do the following:
    1. output related error message indicating record dropped.
    2. output the Internal ID of the record to be dropped.
  18. If join was unsuccessful, then output record to a data set using data set name pattern consisting of the concatenation of:
    1. The MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. "ErrCon_"
    3. Record type:
      1. Party Tables, including Contact, ContactRel, and Identifiers.
      2. Contract Table, including Contract.
    4. The loop increment counter.
    5. The MDMIS BATCH_ID parameter.
    6. The file type identifier of ".ds".
  19. Count record to be dropped insuring that 0 is written to the file if no records are dropped. Use file name pattern consisting of the concatenation of the following in the order presented:
    1. MDMIS FS_TMP_DIR parameter.
    2. Record Type: Party and Contract.
    3. "DropCount_".
    4. Loop increment value.
    5. MDMIS BATCH_ID parameter.
    6. File type identifier: ".seq".
  20. Aggregate and output record identifiers to drop. Use data set naming pattern consisting of the concatenation of the following in the order presented:
    1. MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. Record type: Parties and Contracts.
    3. "ToDrop_".
    4. Loop increment value +1.
    5. MDMIS BATCH_ID parameter.
    6. File type identifier: ".ds".
  21. Append record identifiers to the final records to drop data set. Use data set naming pattern consisting of the concatenation of the following in the order presented:
    1. MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. Record type: Parties and Contracts.
    3. "ToDropFinal".
    4. MDMIS BATCH_ID parameter.
    5. File type identifier: ".ds".
  22. Examine loop increment setting:
    1. If the loop counter + 1 would exceed the MDMIS DS_DROP_MAX_ITERATIONS parameter and the count of the record dropped is > 0 then abort job.
    2. If the loop counter + 1 would exceed the MDMIS DS_DROP_MAX_ITERATIONS parameter and the count of the record dropped is 0 then go to step 25.
    3. If the loop counter + 1 does not exceed the MDMIS DS_DROP_MAX_ITERATIONS parameter then continue.
  23. Rename ErrCon_ data sets increasing the loop increment portion of the data set name.
  24. Increment loop counter and return to step 13.
  25. Read in the final record to drop data set (step 22).
  26. Sort record identifiers.
  27. Open record data set.
    1. For those not part of the loop they are the data sets prefixed with "Valid".
    2. For those involved in the loop the data set naming pattern is described in step 19.
  28. Join on Primary keys.
  29. Examine join set. If the join is successful then output related error message indicating record dropped by association. If the join is unsuccessful then output record to a data set using data set name pattern consisting of the concatenation of:
    1. MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. "ErrCon_".
    3. Record type data set name: as described in section BIL DataStage Import functionality.
    4. MDMIS BATCH_ID parameter.
    5. File type identifier of ".ds".
  30. Combine and output error message, then use a file name pattern consisting of the concatenation of the following in the order presented:
    1. MDMIS FS_ERROR_DIR parameter.
    2. Record Type: PARTY and CONTRACT.
    3. "_ErrCon".
    4. File type identifier ".txt".

BIL DataStage Data Quality Error Consolidation functionality extension guidance

Data Quality Error Consolidation functionality can be extended to address the addition of a new data table. Given that this functionality is focused on error messages, the other areas regarding additional record type sub-type (RT/ST) record columns, and changes to an incoming RT/ST column metadata do not impact this functional area.

Extension guidance: Additional data tables

There are two distinct cases to take into consideration: when a new data table has foreign key relationships, and when it does not. For clarification it must be clearly stated that the resulting tables will have a primary key that points to either CONTEQUIV, CONTRACT, or NATIVE KEY tables these are not viewed in the context of this discussion as foreign keys. In the context of this discussion, a foreign key are fields that can be mandatory that point to an associated record instance. The following sub sections present the extension guidance specific to these two cases.

Case 1: Additional data tables without foreign keys

In this instance the table needs to be added to the RT specific job (BL_040_EC_Party_Last_Drop or BL_040_EC_Contract_Last_Drop).

The Last Drop job utilizes the ILECDropByAssociation shared container. The shared container utilizes six parameters; MDMIS parameter set name, VALID_RECS_INPUT_DS, VALID_RECS_OUTPUT_DS, SSK_FIELD_ONE, SSK_FIELD_TWO and ERROR_REASON_CODE.

The MDMIS parameter set name is readily available in either of these jobs. The VALID_RECS_INPUT_DS points to the data set where those records that will be used in the Primary Key join with those records to be dropped are found.

The VALID_RECS_OUTPUT_DS point to the data set where records that were not found in error are to be written. The SSK_FIELD_ONE and SSK_FIELD_TWO are used to hold the ADMIN_SYS_TP_CD and either ADMIN_CLIENT_ID or ADMIN_CONTRACT_ID or ADMIN_CLIENT_OR_CONTRACT_ID depending on the applicable RT/ST. The ERROR_REASON_CODE is the RT/ST specific "Record Dropped By Association" error reason code that should be established and stored in the ERRREASON table.

Case 2: Additional data tables with foreign keys

In this instance the table needs to be added to the two RT-specific jobs that handle the iterative dropping of foreign key records (BL_040_EC_Party_Iterative_Drop or BL_040_EC_Contract_Iterative_Drop), and the final dropping of primary key records (BL_040_EC_Party_Last_Drop or BL_040_EC_Contract_Last_Drop).

Both of the iterative drop jobs utilize the tx_JoinPrepAndSplit transformer stage to set up the RT/ST foreign keys and place them onto an output link to a local container. The local containers available in the respective jobs provides a good template to the creation of a new local container to handle new foreign key joins and how to handle the appropriate outputs.

The final dropping of records in error, as described in the previous section, needs to be performed for the records of any new data tables.

Assign Identifiers and Insert functionality

The functionality to assign business identifiers and insert record into data tables much like other functional areas in BIL follows a high-level pattern. This area of functionality presents other opportunities for extension. This high-level pattern consists of the following steps:

  1. Read in the RT/ST specific data set.
  2. Create identifier field(s).
  3. Obtain and/or generate identifier(s).
    1. In applicable instances, create reference data set to be used by other related RT jobs.
    2. Records that require related identifiers obtain them from related RT reference data sets.
    3. Update identifier field(s).
  4. If applicable, extract embedded associated data records.
    1. If applicable, then create, obtain, or generate associated identifier(s) (see step 2 and 3 previously).
    2. Update identifier(s) fields.
  5. Insert records into the related operational database table.
  6. If applicable, insert records into the related history database table.

Figure 5 shows a flow chart representation of this high-level functional pattern.

Figure 5. Assign Identifier and Insert Pattern Flowchart
This figure shows a depiction of the assignment of record identifiers and the database insertion process as a flowchart)

BIL DataStage Assign ID and Insert functionality

In regards to BIL, the Assign ID and Insert pattern is expressed in terms of the defined record types and record sub-types. There are distinct differences in how some of the RT/ST are handled. A majority of the RT/ST defined data sets contain embedded related records, and are used to establish general identifiers such as the CONTEQUIV CONT_ID which is used in Party record type records. The following captures at a high level the most general flow of the Assign IS and insert functionality BIL utilizing DataStage.

  1. Read in RT/ST specific data set using the following:
    1. MDMIS FS_DATA_SET_HEADER_DIR parameter.
    2. "ErrCon_"
    3. Record type data set name: as described in section BIL DataStage Import functionality step 7.
    4. MDMIS BATCH_ID parameter.
    5. File type identifier of ".ds".
  2. Create and append identifier field to each record.
  3. Create identifier. If a RT-specific prefix is required, use the AINextKeyPrefix shared container with the following parameters:
    1. scpSK_MASK using the MDMIS SK_MASK parameter, which is the mask to describe the format of surrogate keys.
    2. scpSK_PREFIX_NEXT_VAL using the MDMIS parameter set RT-specific SK_PREFIX_*_NEXT_VAL.
      • Party: SK_PREFIX_CONT_ID_NEXT_VAL.
      • Contract: SK_PREFIX_CONTRACT_ID_NEXT_VAL.
    3. scpSK_PREFIX_SF using the MDMIS set RT-specific SK_*_PREFIX_SF parameter.
      • Party: SK_PREFIX_CONT_ID_SF.
      • Contract:SK_PREFIX_CONTRACT_ID_SF.
    4. scpFS_SK_FILE_DIR using the MDMIS FS_SK_FILE_DIR parameter which points to the directory that hold the surrogate key files.
  4. To generate a RT/ST-specific identifier, use the DLAINextKey shared container with the following parameters:
    1. scpSK_MASK using the MDMIS SK_MASK parameter, which is the mask to describe the format of surrogate keys.
    2. scpSK_MID_NEXT_VAL using the MDMIS parameter set RT/ST-specific SK_MID_*_NEXT_VAL.
    3. scpSK_MID_SF using the MDMIS set RT/ST-specific SK_MID_ *_SF parameter.
    4. scpFS_SK_FILE_DIR using the MDMIS FS_SK_FILE_DIR parameter which points to the directory that hold the surrogate key files.
    5. scpSK_LOAD_SUFFIX using the MDMIS SK_LOAD_SUFFIX parameter which is a constant value that is appended to each surrogate key. This avoids possible key collisions with MDM Service generated IDs - 88 (default).
    6. scpID_COL_NAME which is the name of the output column that will be generated, BigInt and NonNull.
    7. scpDROP_COL_LIST consists of columns to DROP, used in the modify DROP keyChange,FOR_SK_PREFIX. Must supply at least one.
  5. Append to record.
  6. Drop any fields generated to facilitate the creation of the identifier (at this time all NULL_ fields are typically dropped).
  7. Where applicable, extract any associated embedded database table records and generate applicable identifiers (steps 2, 3 and 4 above).
    1. Party RT/ST-related database records:
      1. Contact: CONTEQUIV, CONTACT, PERSON and ORG
      2. OrgName: ORGNAME
      3. .PersonName: PERSONSEARCH and PERSONANME
      4. LOBRel: LOBREL
      5. Alert: ALERT
      6. Identifier: IDENTIFIER
      7. ContactRel: CONTACTREL
      8. MiscValue: MISCVALUE
      9. PrivPref: PRIVPREF and PPREFENTITY
      10. Address: ADDRESS, ADDRESSGROUP and LOCATIONGROUP
      11. ContactMethod: LOCATIONGROUP, PHONENUMBER, CONTACTMETHOD and CONTACTMETHODGROUP
    2. Contract RT/ST related database records:
      1. Contract: CONTRACT
      2. NativeKey: NATIVEKEY
      3. .Alert: ALERT
      4. MiscValue: MISCVALUE
      5. ContractComponent: CONTRACTCOMPONENT
      6. ContractCompVal: CONTRACTCOMPVAL
      7. ContractRole: CONTRACTROLE
      8. RoleLocation: ROLELOCATION
  8. For each record type, format an Operational Table record.
  9. For each record type format an History Table record.
  10. Insert Operational Table Record using the BLDBINTABLE database shared container with the following parameters:
    1. MDMIS: MDMIS parameter set name.
    2. O_TABLE: Operational Table name.
  11. If applicable, insert History Table Record. If MDMIS LOAD_HISTORY_FLAG = "C" then use the BLDBINHISTORY database shared container with the following parameters, otherwise ignore.
    1. MDMIS : MDMIS parameter set name.
    2. H_TABLE: History Table name.

BIL DataStage Assign ID and Insert functionality extension guidance

Assign ID and Insert functionality can be extended to address the addition of new data tables. The record types (RT) of Party and Contract have well defined sub-types (ST) that can be added in the case when related database tables are added to BIL.

The SIF mapping spreadsheet provides several good examples of SIF records that consist of related database table records that are intertwined (LocationGroup_AddressGroup_Address), as well as those that are specific to a single table (Identifier). Depending how the SIF records are defined, will influence the direction taken for the extension.

The following steps represent the necessary tasks to accomplish the addition of new data tables to the Assign ID and Insert processing flow in BIL.

  1. Examine the new record keys. Is the structure of the identifier similar to a particular existing database table? If so examine how keys are generated in the related job. Examine the parameters associated with the AssignIdContainers DLAINextKey shared container. The properties are described in BIL DataStage Assign Id and Insert Functionality section. Then examine the parameters associated with the AssignIdContainers AINextKeyPrefix shared container to see if they will require a prefix. The properties are described in BIL DataStage Assign Id and Insert Functionality section.
  2. Are new incoming SIF records representative of more than one database table record?
    1. No. The new incoming SIF records are representative of a single database record type. Examine the following existing jobs to select as a template. Consider the record type and if identifiers are similar in structure:
      • BL_060_AI_Identifier
      • BL_060_AI_OrgName
      • BL_060_AI_ContactRel
      • BL_060_AL_LOBRel
      • BL_061_AI_Contract_Role
      • BL_061_AI_Role_Location
    2. Yes. The new incoming SIF records are representative of records for more than one database table. Examine the following existing jobs to select as a template. Consider the number of intertwined records and if identifiers are similar in structure. Note that there are no Contract record types jobs represented:
      • BL_060_AI_Address
      • BL_060_AI_ContactMethod
      • BL_060_AI_PersonName
      • BL_060_AI_PrivPref
  3. Using Save As, create a new job from the job selected above. Follow the same naming convention as follows:
    1. Party related jobs BL_060_AI_*.
    2. Contract related jobs BL_061_AI_*.
    3. If a new record type begin numbering at 062.
  4. Handle incoming output link properties. Runtime column propagation (RCP) is used extensively, so in some instances the input column metadata does not need to be defined.
  5. Handle output column metadata on related stages that contain column definitions on output links. In some instance RCP is used so there may be a partial set of defined columns.
  6. Set up surrogate key file:
    1. Examine naming convention set up in the MDMIS parameter set.
    2. Use the directory location as is found in the MDMIS parameter set FS_SK_FILE_DIR parameter.
    3. Examine the BL_061_AI_SF_Delete job adding in the new surrogate key file(s).
    4. Examine the BL_061_AI_SF_Create job adding the new surrogate key file(s).
  7. Examine and set parameters up for key generation shared containers. It should reflect the results of step 1.
  8. Examine the DBContainers shared containers parameters and set appropriately.
    1. BLDBINTABLE: parameters are described in section BIL DataStage Assign ID and Insert functionality section.
    2. BLDBINHISTORY: parameters are described in section BIL DataStage Assign ID and Insert functionality section.
  9. Force compile job.
  10. Examine the record type specific Job sequence and add job to sequence. Consider the new RT/ST relationships: Party: BL_060__AI_LD_PARTY and Contract: BL_061__AI_LD_CONTRACT. If new RT: BL_000_BASIC_LOAD.

By carefully following these steps, the loading of a new set of database records into a previously unsupported database table using the BIL DataStage jobs as templates and framework can be accomplished.


Conclusion

The BIL DataStage jobs sequences and jobs provide both a framework and template to the effective extension of the existing functionality. The extension guidance provides details and a moderate level to the proper extension of the existing functionality.

This article does not provide guidance to modifications to the functionality. In that regard, the extension guidance does explicitly identify the functional areas where modification can be made.

In terms of modifications, alternative identifier/key creation logic is the most often mentioned. The extension guidance identifies the identifier creation logic, and the design provides shared containers specifically for that purpose.

It is important to note that any shared container can be made into a local container in any related job, allowing for the logic to be specifically tailored to the requirements and needs of the job. If this tactic is taken, it is strongly recommended to convert the local container to a shared container and store it with a unique name to safeguard it in the advent that a BIL fix pack distribution would inadvertently write over changes made to a job or jobs.

The other area where modifying the existing jobs is often mentioned is in regard to validation. In that regard, the Edit Point shared containers are provided so that specific treatment to incoming records can be done. It is recommended that in that regard, the Edit Point shared container be converted to a local container, altered, and then saved with a unique name.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=936496
ArticleTitle=Extending the Basic Initial Load in InfoSphere Master Data Management Advanced Edition
publish-date=07032013