This article describes the key reference data concepts, data model overview, high-level architecture, walks through a scenario around creation of reference data sets and mappings, and provides some preliminary insights into export and distribution of reference data to external systems.
Although enterprise MDM systems naturally emphasize managing master data, a special class called reference data is used to define aspects (like the range of permissible values for an attribute) of the master data entities captured. Examples of reference data include sets of status codes, employee types, states/province/country codes, accounting codes, and so on. Such reference data sets, comprising a range of permissible values for entity attributes, usually reside within specialized tables known as look-up, code, check, or domain tables.
Reference data can be distinguished from metadata and master data. Firstly, metadata describes the structure of an entity whereas reference data describes only the range of permissible values for an attribute of an entity. Reference data changes less frequently over time as compared to master data. Further, reference data tends to have a semantic meaning at the row (or instance) level, while master data has entity-level semantics. Lastly, reference data semantics may change over time. For instance, an organizational code associated with an entity in a data warehouse (storing records spanning past decade) might refer to the organization before or after a merger.
Reference data codes have applications in controlling permissible values of database columns, classification of related entities, performing domain look-ups, or relating a controlled vocabulary of terms with their valid range.
Although enterprise reference data tends to be standardized before getting consumed by individual operational systems and applications, it often differs in representation or semantics across different, often silo-ed, applications. This semantic difference can be unavoidable since applications often require their local representations for improved processing. For example, in Figure 1, columns C2 in Source and Target represent the set of country codes, which is a type of reference data. However, as observed, source and target tables have different representations of the same country codes.
Figure 1. Different representations of country codes
Due to this reason, before performing data integration or distribution (moving data from a source to a target system), there is a need to transform the source representation to one that can be understood by the target. This process is known as reference data transcoding and is a key step in master data integration and distribution pipeline.
This general scenario motivates a set of requirements on which any comprehensive reference data management solution ought to be based. The following section looks at the high-level model of a reference data management system.
A reference data management system has a set of key entities that help in managing the overall reference data lifecycle. Figure 2 shows these key entities and their high-level inter-relations. This section describes these entities and related concepts in detail.
Figure 2. Reference data logical model
A managed entity is a blueprint of an abstract entity that contains enough information (properties) necessary for managing the lifecycle of any entity that extends it.
These properties comprise name and/or description, owner(s), version, lifecycle state, timestamps (effective, expiry, review, and so on), and type. Entities in the model shown previously in Figure 2 (sets, mapping sets, hierarchies), all share the common attributes from the managed entity definition that is shown in Figure 3.
Figure 3. Managed entity
Reference data sets form the core of a reference data management system and are used to store reference data values. A reference data set (like all other first-class citizens in a reference data management system) is represented as a specialized managed entity comprised of one or more reference data values as shown in Figure 4, with possible properties like Code, Name, Description, and so on.
Figure 4. An example of a reference data set
Some of the properties (such as Code, Name) may be designed to be required, in which case, they must be set with an initial value at the time the reference data value is being created.
Each reference data value should usually have a sequence number, and related audit information such as review date, effective date, and expiration date.
Reference data sets can be organized using a grouping scheme into a folder hierarchy, such as grouping by users, grouping by topics, and so on.
Reference data values can have multiple language translations, and these translations can also have associated descriptions. A reference data management system provides you with the ability to define these translations and associate them with the reference data value.
As mentioned previously, every reference data set created within a reference data management system has a set of properties associated both at the set and value level. There is a default set of core or fixed properties and an optional set of dynamic or custom properties. Properties (default or custom) maintained at the reference data set level are called data set properties, while those maintained at the reference data value level are called data value properties.
A few examples of core value properties are Code, Name, Description, Sort order, Review date, Effective date, Expiration date, and so on. At the set level, core properties tend to be Name, Description, Owner, Version, Lifecycle process, State, Type, Review date, Effective date, Expiration date, last update date, and so on.
Dynamic or custom properties are defined using entities called reference data types, and every reference data set definition is associated with such a type or a default type.
In general, a reference data management system defines data types for all of its top-level entities (sets, mappings, hierarchies, and so on). Usually a few default data types are provided out-of-the-box to get a user started. These default types can be used to create simple reference data entities (such as sets) that only need the default properties. Custom data types with additional (custom) properties can be used to define more complex entities.
As mentioned previously, custom properties can be created at the reference data set and/or value level. Generally supported data types for custom properties are as follows.
- String: Any character data. The contents can be constrained by applying a regular expression to validate the content.
- Text: One or more strings spanning multiple lines.
- Integer: A valid integer.
- Date: A valid date, containing day, month and year.
- TimeStamp: A two part value that consists of a valid date and valid time of day.
- Reference data set: A relationship to another reference data set. The range of values taken by this relationship is decided by the reference data set that it points to, such as a relation called hasState from a City set to State set.
- Boolean: A value representing a true or false state.
- URL: A valid URL string.
More information on data validation of property types can be found in the RDM information center located in the Resources section.
Reference data mappings provide a way for you to define relationships across reference data sets and set values. Reference data mappings are also maintained as first-class citizens (alongside sets) and defined as managed mapping set entities within a reference data management system. In addition to the properties inherited from the managed entity, a mapping has additional properties to represent source and target sets.
Once a mapping set is created, value mappings can be created by selecting a source value and a target value from source and target sets respectively. A reference data value mapping may contain all the necessary information and attributes much like a reference data value.
This next section gives an architectural overview of InfoSphere MDM Reference Data Management Hub and demonstrates how it implements the previously described concepts.
The InfoSphere Master Data Management (MDM) Reference Data Management Hub is built on the proven InfoSphere MDM platform and delivers a master data management approach to managing enterprise reference data. It helps reduce business risk, improve enterprise data quality, and enhance operational efficiency. RDM is based on a three-tiered component architecture, comprising a client and a server application interacting with a back-end database, which is hosting the application-specific data and required metadata. Figure 5 shows a high-level overview of RDM.
Figure 5. RDM logical architecture
The client-side is a web UI application with the key design goal of allowing collaborative authoring and having a flexible data model. Business users can use the web UI to define new reference data sets and manage reference data standards without requiring significant IT intervention. The web UI is also designed for role-based users to view, author, map, and approve reference data sets within a central repository. This allows reference data sets to be created and managed in a controlled manner. User actions on the web UI trigger requests, which get handled by appropriate service controllers present in the REST layer. The REST layer services invoke the server-side transactions to manage CRUD procedures on RDM database. Figure 6 shows the high-level component architecture.
Figure 6. RDM component architecture
The server-side is built upon the base Custom Domain Hub (CDH) by defining the reference data domain model along with reference data management services. This allows RDM to re-use several out-of-the-box features of CDH such as business rules, event notification, data quality, and audit history. In addition, several reference data management specific services are implemented to achieve key functionality like import and export, reference data set lifecycle management, transcoding and distribution, versioning, and so forth.
The client and server enterprise archives reside in a WebSphere Application Server instance. There is a support for clustering, but it is recommended that the client and server reside on the same node. The currently supported databases are IBM DB2 and Oracle.
The following section demonstrates a typical reference data transcoding scenario and shows how a reference data management system can be used by customers.
RDM provides a web interface, which allows a data steward to create their reference data sets and mappings between the sets. Business analysts and knowledge managers can manage the entire lifecycle of data entities through this interface.
The following demonstration is developed using a working instance of IBM RDM v10.
- Log on to the RDM application via a web browser like Firefox 10 or IE
9 with a valid URL, for example:
https://localhost:9043/RefDataClient. You are presented with a multiple-tabbed web page.
- The Reference Data Sets tab is selected by default
and it comprises the folder view and the list view. Reference Data
Sets can be organized in folders within the Folder
view. You can either create a new folder or select an existing folder
such as Examples, as shown in Figure 7.
Figure 7. Create a set wizard
For this scenario, all the sets will be created under the Examples folder. If you like, you can create a new set by either right-clicking to bring up a context menu, or clicking New Set under the Reference sets list. The Create a Set page shown in Figure 7 provides the following property values:
- Name: MDMStates (Name of the reference data set to be created).
- Version:1 (First version of the set).
- Type: CountryType (A user-defined data type containing the default set and value-level properties and an additional custom property, Country, which points to a reference data set called Country, with a default data type. This assumes that a reference data set Country has been created prior to doing this).
- Lifecycle process: Simple approval process (A simple-state lifecycle model for managing the reference data set).
- Click OK and a new set named MDMStates is created. You can then select the MDMStates set, and create set values manually by adding new values from UI or import a predefined CSV or XML format MDMStates code values file into the set using the import wizard.
- Import the MDMStates.CSV file into the MDMStates set by right-clicking MDMStates set to start the Import wizard. The Import wizard provides an intuitive way to import data into RDM application through a set of guided navigation panels. Various operations that are to be performed are listed towards the left. They comprise choosing import definition, mapping file columns to reference data set properties, allowing a data preview before performing the import, and displaying a summary after the import is finished.
- The first page of the wizard prompts you to provide date format, file
format, a separator, and the CSV file path as shown in Figure 8.
Figure 8. Choose import file
- Click Next and map the reference data set properties
to the column names in the CSV file using Map File
Columns page as shown in Figure 9.
Figure 9. Map file columns
- Inspect the auto-mapped columns and manually correct any inaccurate mapping. By default, the RDM application uses matching logic to try and map obvious matches between property file column names and reference data set properties. If the CSV file has translations for reference data values in addition to set properties, Translation Language, Translation Value, and Translation Description would also be auto-mapped to corresponding columns from the import file.
- Click Next to reveal a Preview File page displaying the warnings, exceptions, and so on. In this file, as there are no warnings or exceptions, the Preview File page is empty. The import process skips the entire row if a column value in a row has an error such as date format does not match the specified data format, and so on.
- Click Next to get the Summary page
which shows you a summary on how many items are successfully imported,
as shown in Figure 10.
Figure 10. Summary
This shows that 51 values have been successfully imported into the MDMStates set.
- Click Done on the Summary page to complete the
importing process. All the 51 reference data set values are created
for the MDMStates set and listed under the Reference value. You can
view each value properties by clicking on the value as shown in Figure
Figure 11. Reference value properties
Similarly, create another set called CRMStates and either manually create or import code values into it.
Now, you will use the reference data mapping functionality to create a mapping between the values in the MDMStates and CRMStates set.
- Navigate to the Mappings tab and create a new mapping
by clicking New, as shown in Figure 12.
Figure 12. Create mapping
- Specify source and target sets as the ones created previously (MDMStates, CRMStates), and other properties as indicated in Figure 12.
- Click OK and a new mapping set will be created and opened for editing.
- Start creating value mappings by clicking New in the
Value Mappings frame as in Figure 13.
Figure 13. Create value mapping
- Select values from source and target sets, and click Create Mapping.
- Specify any properties on the Create Value Mapping page and
click OK, as shown in Figure 14.
Figure 14. Values from source and target
Once all the value mappings have been defined, as shown in Figure 15, the mapping set would contain the knowledge necessary to perform transcoding of reference data values from source to target systems.
Figure 15. Value mappings defined
As the next step in the transcoding process, a steward or user could export the mapping created previously and use it to transform values between the source and target in the information integration pipeline for semantic equivalence and consistency. RDM web interface provides an export capability to export a reference data set, or a mapping to a file (character-delimited or XML). For doing this, within the mappings view, right-click on the mapping created previously to bring up a context menu and select Export to either an XML or a CSV file. For applications requiring a programmatic way to export mappings, RDM provides a simple-to-understand web services API, which could be invoked by the applications. RDM also provides a batch exporter, which could be directly invoked from the command-line. Finally, functionality is provided whereby systems could subscribe to the reference data sets that they are interested in, and have the changes to those sets published directly.
This article introduced key reference data concepts, the underlying model, a high-level architecture, and performed a walk through of a simple transcoding scenario around the creation of reference data sets and mappings.
We would like to thank Erik O'Neill for his review, feedback and perspectives during the development of this article.
|Sample for this article||Sample.zip||3KB||HTTP|
- Visit the InfoSphere MDM Reference Data Management Hub Information
- Read the "InfoSphere MDM V10.0 RDM Hub" release announcement.
- Read the "Metadata Exploitation in large-scale Data Migration Projects"
- Read the "An introduction to the Master Data Management Reference Architecture
" developerWorks article.
- Read the "Ontology-guided Reference Data Alignment in Information Integration
- Learn more about Going with
the flow IBM data management.
- Learn more about InfoSphere MDM Reference Data Management Hub V10 .
- Visit the developerWorks
Information Management zone to find more resources for DB2
developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a
variety of IBM products and IT industry topics.
- Attend a free
developerWorks Live! briefing to get up-to-speed quickly on IBM
products and tools as well as IT industry trends.
- Follow developerWorks on
- Watch developerWorks on-demand demos ranging from product installation
and setup demos for beginners, to advanced functionality for experienced
Get products and technologies
- Build your next
development project with IBM trial software, available
for download directly from developerWorks.
products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few
hours in the SOA Sandbox learning how to implement Service Oriented
- Get involved in the My developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
Fenglian Xu is a lead developer in the IBM Reference Data Management development team at the Hursley lab in UK. With her 15 years IT industrial experience, she has contributed to the portfolio of the IBM middleware products: WebSphere Service Registry and Repository, WebSphere Enterprise Service Bus, and WebSphere Process Server etc. Her expertise includes IBM middleware product integration in service-oriented architectures, J2EE and Web Services. She is also an IBM developerWorks Contributing Author and has over 20 publications. She earned a B.S in Mathematics from Xian Jiaotong University in 1989, and a Ph.D. in Computer Science from the University of Southampton in 1998. You can contact Fenglian at firstname.lastname@example.org.
Sushain Pandit is an IBM Master Inventor and Staff Engineer in IBM's Austin lab, with experience contributing to industry leading enterprise information management solutions and products. Currently, he is working in the Master Data Management portfolio Research and Development and contributing to the development, architecture, customer engagement and adoption of reference data management product. He has filed over 15 US patent applications, has co-authored five published papers, and holds a masters degree through research in computer science from Iowa State University. His interests broadly span structured and unstructured data, information extraction and integration, semantic web and machine learning.
Dan Mandelstein is an IBM Certified IT Architect, with 20+ years working in the IT industry. Currently, he is working in the Master Data Management portfolio and is the Architect and Development Manager for IBM's reference data management product. He has a Bachelor of Science degree in computer science from the University of Texas and a Masters degree in computer science from the University of Texas at Dallas.