Introduction to IBM InfoSphere Master Data Management Reference Data Management Hub V10

IBM® InfoSphere® Master Data Management (MDM) Reference Data Management Hub v10.0 (RDM), released in July 2012, is a new product providing a series of features that are essential to seamlessly managing reference data in the information landscape across different business domains, including comprehensive authoring and lifecycle management functionality, versioning model, role-based security, and access control.

Fenglian Xu (xufengli@uk.ibm.com), Software Engineer, IBM

Photo of Fenglian XuFenglian Xu is a lead developer in the IBM Reference Data Management development team at the Hursley lab in UK. With her 15 years IT industrial experience, she has contributed to the portfolio of the IBM middleware products: WebSphere Service Registry and Repository, WebSphere Enterprise Service Bus, and WebSphere Process Server etc. Her expertise includes IBM middleware product integration in service-oriented architectures, J2EE and Web Services. She is also an IBM developerWorks Contributing Author and has over 20 publications. She earned a B.S in Mathematics from Xian Jiaotong University in 1989, and a Ph.D. in Computer Science from the University of Southampton in 1998. You can contact Fenglian at xufengli@uk.ibm.com.


developerWorks Contributing author
        level

Sushain Pandit (sushain.pandit@us.ibm.com), Staff Engineer, IBM

Author photo PanditSushain Pandit is an IBM Master Inventor and Staff Engineer in IBM's Austin lab, with experience contributing to industry leading enterprise information management solutions and products. Currently, he is working in the Master Data Management portfolio Research and Development and contributing to the development, architecture, customer engagement and adoption of reference data management product. He has filed over 15 US patent applications, has co-authored five published papers, and holds a masters degree through research in computer science from Iowa State University. His interests broadly span structured and unstructured data, information extraction and integration, semantic web and machine learning.



Dan Mandelstein (dmandel@us.ibm.com), Product Development Manager, IBM

Author photo MandelsteinDan Mandelstein is an IBM Certified IT Architect, with 20+ years working in the IT industry. Currently, he is working in the Master Data Management portfolio and is the Architect and Development Manager for IBM's reference data management product. He has a Bachelor of Science degree in computer science from the University of Texas and a Masters degree in computer science from the University of Texas at Dallas.



04 October 2012

Also available in Chinese Japanese

This article describes the key reference data concepts, data model overview, high-level architecture, walks through a scenario around creation of reference data sets and mappings, and provides some preliminary insights into export and distribution of reference data to external systems.

Introduction

Although enterprise MDM systems naturally emphasize managing master data, a special class called reference data is used to define aspects (like the range of permissible values for an attribute) of the master data entities captured. Examples of reference data include sets of status codes, employee types, states/province/country codes, accounting codes, and so on. Such reference data sets, comprising a range of permissible values for entity attributes, usually reside within specialized tables known as look-up, code, check, or domain tables.

Reference data can be distinguished from metadata and master data. Firstly, metadata describes the structure of an entity whereas reference data describes only the range of permissible values for an attribute of an entity. Reference data changes less frequently over time as compared to master data. Further, reference data tends to have a semantic meaning at the row (or instance) level, while master data has entity-level semantics. Lastly, reference data semantics may change over time. For instance, an organizational code associated with an entity in a data warehouse (storing records spanning past decade) might refer to the organization before or after a merger.

Reference data codes have applications in controlling permissible values of database columns, classification of related entities, performing domain look-ups, or relating a controlled vocabulary of terms with their valid range.

Although enterprise reference data tends to be standardized before getting consumed by individual operational systems and applications, it often differs in representation or semantics across different, often silo-ed, applications. This semantic difference can be unavoidable since applications often require their local representations for improved processing. For example, in Figure 1, columns C2 in Source and Target represent the set of country codes, which is a type of reference data. However, as observed, source and target tables have different representations of the same country codes.

Figure 1. Different representations of country codes
Image is of the different representations of country codes

Due to this reason, before performing data integration or distribution (moving data from a source to a target system), there is a need to transform the source representation to one that can be understood by the target. This process is known as reference data transcoding and is a key step in master data integration and distribution pipeline.

This general scenario motivates a set of requirements on which any comprehensive reference data management solution ought to be based. The following section looks at the high-level model of a reference data management system.


Reference data logical model

A reference data management system has a set of key entities that help in managing the overall reference data lifecycle. Figure 2 shows these key entities and their high-level inter-relations. This section describes these entities and related concepts in detail.

Figure 2. Reference data logical model
Image is of the reference data logical model

Managed entity

A managed entity is a blueprint of an abstract entity that contains enough information (properties) necessary for managing the lifecycle of any entity that extends it.

These properties comprise name and/or description, owner(s), version, lifecycle state, timestamps (effective, expiry, review, and so on), and type. Entities in the model shown previously in Figure 2 (sets, mapping sets, hierarchies), all share the common attributes from the managed entity definition that is shown in Figure 3.

Figure 3. Managed entity
Image is of the managed entities like description, owner, version, and so on.

Reference data set

Reference data sets form the core of a reference data management system and are used to store reference data values. A reference data set (like all other first-class citizens in a reference data management system) is represented as a specialized managed entity comprised of one or more reference data values as shown in Figure 4, with possible properties like Code, Name, Description, and so on.

Figure 4. An example of a reference data set
Image is of an example of a reference data set with possible properties like code, name, description, and so on.

Some of the properties (such as Code, Name) may be designed to be required, in which case, they must be set with an initial value at the time the reference data value is being created.

Each reference data value should usually have a sequence number, and related audit information such as review date, effective date, and expiration date.

Reference data sets can be organized using a grouping scheme into a folder hierarchy, such as grouping by users, grouping by topics, and so on.

Translations

Reference data values can have multiple language translations, and these translations can also have associated descriptions. A reference data management system provides you with the ability to define these translations and associate them with the reference data value.

Reference data types

As mentioned previously, every reference data set created within a reference data management system has a set of properties associated both at the set and value level. There is a default set of core or fixed properties and an optional set of dynamic or custom properties. Properties (default or custom) maintained at the reference data set level are called data set properties, while those maintained at the reference data value level are called data value properties.

A few examples of core value properties are Code, Name, Description, Sort order, Review date, Effective date, Expiration date, and so on. At the set level, core properties tend to be Name, Description, Owner, Version, Lifecycle process, State, Type, Review date, Effective date, Expiration date, last update date, and so on.

Dynamic or custom properties are defined using entities called reference data types, and every reference data set definition is associated with such a type or a default type.

In general, a reference data management system defines data types for all of its top-level entities (sets, mappings, hierarchies, and so on). Usually a few default data types are provided out-of-the-box to get a user started. These default types can be used to create simple reference data entities (such as sets) that only need the default properties. Custom data types with additional (custom) properties can be used to define more complex entities.

Custom properties

As mentioned previously, custom properties can be created at the reference data set and/or value level. Generally supported data types for custom properties are as follows.

  • String: Any character data. The contents can be constrained by applying a regular expression to validate the content.
  • Text: One or more strings spanning multiple lines.
  • Integer: A valid integer.
  • Date: A valid date, containing day, month and year.
  • TimeStamp: A two part value that consists of a valid date and valid time of day.
  • Reference data set: A relationship to another reference data set. The range of values taken by this relationship is decided by the reference data set that it points to, such as a relation called hasState from a City set to State set.
  • Boolean: A value representing a true or false state.
  • URL: A valid URL string.

More information on data validation of property types can be found in the RDM information center located in the Resources section.

Reference data mapping

Reference data mappings provide a way for you to define relationships across reference data sets and set values. Reference data mappings are also maintained as first-class citizens (alongside sets) and defined as managed mapping set entities within a reference data management system. In addition to the properties inherited from the managed entity, a mapping has additional properties to represent source and target sets.

Once a mapping set is created, value mappings can be created by selecting a source value and a target value from source and target sets respectively. A reference data value mapping may contain all the necessary information and attributes much like a reference data value.


Architectural overview

This next section gives an architectural overview of InfoSphere MDM Reference Data Management Hub and demonstrates how it implements the previously described concepts.

The InfoSphere Master Data Management (MDM) Reference Data Management Hub is built on the proven InfoSphere MDM platform and delivers a master data management approach to managing enterprise reference data. It helps reduce business risk, improve enterprise data quality, and enhance operational efficiency. RDM is based on a three-tiered component architecture, comprising a client and a server application interacting with a back-end database, which is hosting the application-specific data and required metadata. Figure 5 shows a high-level overview of RDM.

Figure 5. RDM logical architecture
Image is of the RDM logical architecture, which is three tiered component comprising a client and server application interacting with a backend database.

The client-side is a web UI application with the key design goal of allowing collaborative authoring and having a flexible data model. Business users can use the web UI to define new reference data sets and manage reference data standards without requiring significant IT intervention. The web UI is also designed for role-based users to view, author, map, and approve reference data sets within a central repository. This allows reference data sets to be created and managed in a controlled manner. User actions on the web UI trigger requests, which get handled by appropriate service controllers present in the REST layer. The REST layer services invoke the server-side transactions to manage CRUD procedures on RDM database. Figure 6 shows the high-level component architecture.

Figure 6. RDM component architecture
Image is of the overview of the RDM compontent architecture.

The server-side is built upon the base Custom Domain Hub (CDH) by defining the reference data domain model along with reference data management services. This allows RDM to re-use several out-of-the-box features of CDH such as business rules, event notification, data quality, and audit history. In addition, several reference data management specific services are implemented to achieve key functionality like import and export, reference data set lifecycle management, transcoding and distribution, versioning, and so forth.

The client and server enterprise archives reside in a WebSphere Application Server instance. There is a support for clustering, but it is recommended that the client and server reside on the same node. The currently supported databases are IBM DB2 and Oracle.


Example scenario: Utilizing RDM hub to author and map reference data

The following section demonstrates a typical reference data transcoding scenario and shows how a reference data management system can be used by customers.

RDM provides a web interface, which allows a data steward to create their reference data sets and mappings between the sets. Business analysts and knowledge managers can manage the entire lifecycle of data entities through this interface.

The following demonstration is developed using a working instance of IBM RDM v10.

Create reference data sets

  1. Log on to the RDM application via a web browser like Firefox 10 or IE 9 with a valid URL, for example: https://localhost:9043/RefDataClient. You are presented with a multiple-tabbed web page.
  2. The Reference Data Sets tab is selected by default and it comprises the folder view and the list view. Reference Data Sets can be organized in folders within the Folder view. You can either create a new folder or select an existing folder such as Examples, as shown in Figure 7.
    Figure 7. Create a set wizard
    Image is of the create a set wizard where you can create a new folder or select an existing folder like Examples.
    For this scenario, all the sets will be created under the Examples folder. If you like, you can create a new set by either right-clicking to bring up a context menu, or clicking New Set under the Reference sets list. The Create a Set page shown in Figure 7 provides the following property values:
    • Name: MDMStates (Name of the reference data set to be created).
    • Version:1 (First version of the set).
    • Type: CountryType (A user-defined data type containing the default set and value-level properties and an additional custom property, Country, which points to a reference data set called Country, with a default data type. This assumes that a reference data set Country has been created prior to doing this).
    • Lifecycle process: Simple approval process (A simple-state lifecycle model for managing the reference data set).
  3. Click OK and a new set named MDMStates is created. You can then select the MDMStates set, and create set values manually by adding new values from UI or import a predefined CSV or XML format MDMStates code values file into the set using the import wizard.
  4. Import the MDMStates.CSV file into the MDMStates set by right-clicking MDMStates set to start the Import wizard. The Import wizard provides an intuitive way to import data into RDM application through a set of guided navigation panels. Various operations that are to be performed are listed towards the left. They comprise choosing import definition, mapping file columns to reference data set properties, allowing a data preview before performing the import, and displaying a summary after the import is finished.
  5. The first page of the wizard prompts you to provide date format, file format, a separator, and the CSV file path as shown in Figure 8.
    Figure 8. Choose import file
    Image is of the import reference data set wizard where you choose the import file to provide date format, file format, and so on.
  6. Click Next and map the reference data set properties to the column names in the CSV file using Map File Columns page as shown in Figure 9.
    Figure 9. Map file columns
    Image is of the map file column names in the CSV file.
  7. Inspect the auto-mapped columns and manually correct any inaccurate mapping. By default, the RDM application uses matching logic to try and map obvious matches between property file column names and reference data set properties. If the CSV file has translations for reference data values in addition to set properties, Translation Language, Translation Value, and Translation Description would also be auto-mapped to corresponding columns from the import file.
  8. Click Next to reveal a Preview File page displaying the warnings, exceptions, and so on. In this file, as there are no warnings or exceptions, the Preview File page is empty. The import process skips the entire row if a column value in a row has an error such as date format does not match the specified data format, and so on.
  9. Click Next to get the Summary page which shows you a summary on how many items are successfully imported, as shown in Figure 10.
    Figure 10. Summary
    Image is of the summary of the import reference data set showing how many items were successfully imported.

    This shows that 51 values have been successfully imported into the MDMStates set.

  10. Click Done on the Summary page to complete the importing process. All the 51 reference data set values are created for the MDMStates set and listed under the Reference value. You can view each value properties by clicking on the value as shown in Figure 11.
    Figure 11. Reference value properties
    Image shows the reference value properties for the MDMStates set.

Similarly, create another set called CRMStates and either manually create or import code values into it.

Create reference data mapping between the sets

Now, you will use the reference data mapping functionality to create a mapping between the values in the MDMStates and CRMStates set.

  1. Navigate to the Mappings tab and create a new mapping by clicking New, as shown in Figure 12.
    Figure 12. Create mapping
    Image shows the create a mapping wizard where you click NEW to create a new mapping.
  2. Specify source and target sets as the ones created previously (MDMStates, CRMStates), and other properties as indicated in Figure 12.
  3. Click OK and a new mapping set will be created and opened for editing.
  4. Start creating value mappings by clicking New in the Value Mappings frame as in Figure 13.
    Figure 13. Create value mapping
    Image is of the create a mapping wizard where you start creating the value mappings.
  5. Select values from source and target sets, and click Create Mapping.
  6. Specify any properties on the Create Value Mapping page and click OK, as shown in Figure 14.
    Figure 14. Values from source and target
    Image is of the create a mapping wizard where you specify the properties on the create value mapping.
    Once all the value mappings have been defined, as shown in Figure 15, the mapping set would contain the knowledge necessary to perform transcoding of reference data values from source to target systems.
    Figure 15. Value mappings defined
    Image shows all of the value mappings defined

Export reference data set

As the next step in the transcoding process, a steward or user could export the mapping created previously and use it to transform values between the source and target in the information integration pipeline for semantic equivalence and consistency. RDM web interface provides an export capability to export a reference data set, or a mapping to a file (character-delimited or XML). For doing this, within the mappings view, right-click on the mapping created previously to bring up a context menu and select Export to either an XML or a CSV file. For applications requiring a programmatic way to export mappings, RDM provides a simple-to-understand web services API, which could be invoked by the applications. RDM also provides a batch exporter, which could be directly invoked from the command-line. Finally, functionality is provided whereby systems could subscribe to the reference data sets that they are interested in, and have the changes to those sets published directly.


Conclusion

This article introduced key reference data concepts, the underlying model, a high-level architecture, and performed a walk through of a simple transcoding scenario around the creation of reference data sets and mappings.

Acknowledgement

We would like to thank Erik O'Neill for his review, feedback and perspectives during the development of this article.


Download

DescriptionNameSize
Sample for this articleSample.zip3KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=838176
ArticleTitle=Introduction to IBM InfoSphere Master Data Management Reference Data Management Hub V10
publish-date=10042012