InfoSphere Information Analyzer is a tool that allows for profiling and discovery of quality related aspects of existing data stores. It also helps users perform gap analysis between source and target systems.
InfoSphere Information Analyzer’s primary benefits are to:
- Create a greater understanding of data source structure, content and quality
- Measure and report data quality throughout the project life cycle
- Eliminate the risk and uncertainty of proliferating bad data throughout the enterprise
Previous articles in this series have explained the importance of analyzing data quality during SOA service design and approaches to doing so. The objective is to assess if service implementations will meet the required service levels regarding data quality and if additional data transformations or data cleansing operations are required. This assessment is necessary regardless of the service type or implementation chosen. Consider an example where a service aggregates information from multiple operational systems.
Regardless of what integration approach is chosen – SOA or data integration – you need to understand the data being integrated. Do you have keys or identifiers that can be used for data matching across the participating systems? If so, are there missing or duplicate records to resolve? The InfoSphere Information Analyzer quantifies the characteristics of the data sets in question allowing informed decisions to be made for most efficiently and effectively integrating them to produce complete and accurate results.
It is most common that these stores are relational databases, although InfoSphere Information Analyzer also supports analysis of XML, flat files, and other structured file types.
The InfoSphere Information Analyzer leverages the unified metadata repository of IBM Information Server. Because the repository is shared across all products, when data profiling occurs using InfoSphere Information Analyzer, the table definitions and profiling information (for example, primary key and foreign key instances, constraint violations, annotations, and the like) become available to an Information Server user in the DataStage and QualityStage Designer as Figure 1 shows.
Figure 1: Metadata access to analysis details in InfoSphere DataStage
The unified metadata management platform of IBM Information Server also provides a number of services for performance and scalability, which InfoSphere Information Analyzer utilizes. This provides, for instance, common scheduling services that the InfoSphere Information Analyzer user can use to determine the timing of profile executions.
NOTE: This article does not cover details of basic product usage, but focuses on the type of information produced and available in InfoSphere Information Analyzer to support the data quality analysis pattern in SOA. Information on product usage is available in the user manual as well as the IBM Redbook: IBM InfoSphere Information Analyzer & Data Quality Assessment (see Resources).
As Part 6 notes, the data quality analysis itself must be treated as a project itself within the broader confines of the SOA project. Establishing a specific scope to the analysis allows focus on those data sources most critical to the project’s success (for example, data for services which have legal or financial risk, or expensive error recovery consequences). The use of a tool such as the InfoSphere Information Analyzer does not remove the need for an effective process including appropriate framing of objectives. Resources and time in which to perform data quality analysis are limited, and the ultimate goal is the delivery of the SOA services with data that supports the SOA functionality.
InfoSphere Information Analyzer, and the underlying unified metadata architecture, provides frameworks to organize work within a larger context and to persist the knowledge gained over time.
- Shared Metadata: The InfoSphere Information Analyzer incorporates and shares information within the Information Server’s metadata repository about data sources including connectivity, schema, table, and column definitions. This provides for a single comprehensive location of this metadata. Figure 2 highlights the broad set of metadata available for use within the repository. This can potentially encompass thousands of schemas and tables and millions of columns from differing databases and files.
Figure 2. Shared metadata in IBM Information Server
- Project-based structure: The InfoSphere Information Analyzer uses a project-based approach to analysis, allowing users to segregate their work according to their needs, whether based on business unit, SOA or data integration project, or any other desired configuration. The project structure allows configuration of users based on role and the incorporation of specific data sources for analysis. Security and settings are also configurable within the project, allowing access level controls as noted above. Such segregation is important in a project context such as an SOA effort as it allows scoping down to the data sources that are important for analysis. This helps the analyst focus on what’s important, not on trying to sort through the array of potential data sources. As information is discovered, or new sources need to be incorporated, the project framework allows for easy addition.
- Review Status: InfoSphere Information Analyzer allows tracking of analytical review at each level of analysis, rolling up to the project level. This approach allows insight into whether the defined scope of work has been completed or if items remain outstanding as seen in Figure 3.
Figure 3. Project review status in InfoSphere Information Analyzer
- Graphical enablement: InfoSphere Information Analyzer incorporates both textual as well as graphical representations of data. This allows the user to view information in multiple ways, quickly identifying issues or anomalies, and annotating the findings accordingly.
- Annotations: The use of notes throughout InfoSphere Information Analyzer allows you
to capture specific details and findings, expanding the knowledge base for any given
column or table. Use of a note status flag provides a method to track and report (as in Figure 4) on the review of issues.
Figure 4. Tracking status of notes in InfoSphere Information Analyzer
Within InfoSphere Information Analyzer, the project lead can establish the right context or focus for the SOA data quality assessment with the right data, the right users with appropriate roles, and a framework for managing the analytical cycle. Process and people remain critical to success, though.
Data profiling is not magic –- it helps bring data quality issues to light, but you still need a data or business analyst or subject matter expert to review results and draw appropriate conclusions. These analysts must understand the scope and objectives of the effort. If they don’t know what the SOA project is trying to achieve, they may not identify true issues and anomalies that will impact the project. The analysts must be clear on what they are trying to achieve and the deliverables they need to produce. They need an approach on how to make consistent and standard annotations on results, so that others working on the project understand what the annotations mean. Analysts must also understand that running a profiling job is not the end of the process –- the analysis is critical. Some analysts have the mindset that once you run InfoSphere Information Analyzer processes, that data profiling and data quality assessment is done. Data profiling provides insight and inferences into a lot of information quickly, but it still remains for the analyst to weed out the problems and determine what the next steps are.
SOA solutions expose data through services. Service implementations may vary from reuse of application APIs, to encapsulated data queries, to custom written code, but in all cases, the service designer must be sure that the data being exposed meets the expectations of the service consumers. When performing a data quality assessment on any existing data source, you are trying to understand the structure of the data (for example, its data types and relationships) as well as trying to understand how well the existing data population conforms to these structures.
Almost all data sets contain irregularities and anomalies. For example, if 99% of a column’s data is integer and in the range of 25 to 30, then anything that lies outside this range may be suspect. So you are looking to identify statistical anomalies, as well as physical structure information.
You are also looking for embedded business rules or logic that might tell applications accessing this data to allow for some form of exception processing for a business term. For example, if you look at a certain field and find that there is a number like -9999 in the field, it is entirely possible that this is used as an indicator to a program to look elsewhere for the data that should have resided in that field. Most initial assessments find that close to 15% of legacy data has some form of embedded logic or exception handling embedded in the data.
In short, you are determining the state of the data “as it is” today and quantifying how well it meets the data quality requirements for the service that will expose it in your SOA solution. You are also looking at the effort and complexity of recasting or standardizing, cleansing, and enriching the data where it currently falls short of those requirements.
The process of performing data analysis in the context of a set of SOA requirements is very similar to any data quality initiative. The process and use of InfoSphere Information Analyzer is the same. This resultant analysis can be used to reduce project risk and cost as well as to provide more accurate scoping of the project.
Typically, as noted in Part 7 of this series, there are 3 phases to consider when doing a full data assessment. Table 1 represents the steps of analysis performed in each phase.
Table 1. Steps of data analysis
|Source system analysis||Target analysis||Alignment and harmonization analysis|
Source system analysis can be addressed with the techniques available in InfoSphere Information Analyzer and is the primary focus of this article. Aspects of target analysis can be identified through InfoSphere Information Analyzer and are also discussed. InfoSphere Information Analyzer can provide review into outputs from alignment, standardization, and matching techniques, but these are not addressed in this article. We will describe the techniques available to facilitate analysis in the first two phases in more detail in the following sections.
Why do you want to do a source system analysis? Few SOA solutions are developed as completely new information systems –- almost always they are built to leverage existing applications and data. By developing reusable services that expose the capabilities of existing systems in a carefully controlled way, a business can evolve its IT capabilities to be more responsive to change. Only by understanding the nature of the source data being exposed can the SOA service designer meet this goal. For each data source, you can perform the following types of analysis. The results provide insight into the value and integrity of the existing data that will support the SOA solution.
The following techniques may be applied to assess Domain-level Technical Dimensions:
- Frequency distribution
- Domain integrity (or Column Content Analysis)
- Structural Analysis (both domain structure and key structure)
- Relationship Analysis (or Cross Domain Analysis)
- Business rule assessment
- Metadata integrity (or business/technical documentation)
A frequency distribution is automatically generated by InfoSphere Information Analyzer as a core artifact of executing a Column Analysis function. The frequency distribution is stored in the metadata repository as a distinct set of values with their associated count of occurrence. On screen, this is either a graphical or numerical picture of the distribution of values within a column or domain. The frequency distribution is foundational to subsequent steps of analysis. The screen shot in Figure 5 below shows the graphical representation of a frequency distribution and shows a highly skewed set dominated by a pair of values.
Figure 5. Frequency Distribution Graphical View Screen
The frequency distribution is a statistical count. Inferences are made by InfoSphere Information Analyzer based on the distribution and characteristics of the data values. These inferences include identification of a basic data classification and the evaluation of data type, length, precision and scale for numeric values, uniqueness, and identification of basic format patterns. The analyst works with these inferences as well as the basic frequency distribution to assess the integrity of the domain or column and report findings.
While an analyst can work directly on a column-by-column basis through a given data source, InfoSphere Information Analyzer produces a data classification to help organize information for more efficient work, especially when dealing with less known data sources. Figure 6, which shows the Column Analysis summary screen, highlights columns sorted by their inferred data classification to organize data review.
Figure 6. Column Analysis summary: Data classifications
InfoSphere Information Analyzer uses seven primary data classifications as follows:
- Identifiers—a highly unique set of distinct values with low frequency per value based on set threshold
- Indicators—a set of binary values (for example, 0/1, True/False, Yes/No)
- Codes—a distinct set of finite values (generally of limited data length and occurring with some frequency)
- Quantifiers—a set of numeric values representing items such as quantities, prices, salaries, and so on
- Dates—a set of largely date formatted values
- Text—a set of alphanumeric string values that do not fit other classifications
- Unknown—columns with insufficient data to otherwise classify
Using these segregations of data, you can focus on key elements of basic data analysis for each data class. For example, data classified as "unknown" usually indicates a column that is null or empty and probably not used. In most cases, these columns can be skipped over as it is indicative of its lack of importance or the lack of integrity checks to enforce defaulting. However, it may be that this column was identified for use in the service through the canonical data model. In that case, a gap exists and the analyst must record and report the issue for further investigation. This can be accomplished through the Notes feature of InfoSphere Information Analyzer and including Notes on generated reports or through shared metadata.
The following are core focal points and expectations of Source System Analysis based on specific data classification through InfoSphere Information Analyzer.
Table 2: Domain Analysis by data classification
|Data class||Complete||Valid||Consistent||Precise & understood||Unique|
|Identifier||Always populated||Within valid range, no defaulted values||Constant format (all values have same format)||Definition available||Always unique|
|Indicator||Two values, always populated, no nulls||No extraneous or invalid values; skewed distributions accounted for||Constant format||Short, but clear values; definition available||Not applicable|
|Code||Finite values, no defaults||Within valid reference set; skewed distributions accounted for||Standard representation (only one code for each business value)||Definitions of domain and codes available||Not applicable|
|Quantifier||No defaults||Within valid range; skewed distributions accounted for||Standard data type||Definition available||Not applicable|
|Date||Always populated, no defaults or nulls||Within valid range; skewed distributions accounted for||Standard data type||Definition available||Not applicable|
|Text||Always populated, no defaults or nulls||Evaluate for embedded business logic or invalid values||No unexpected formats; standard format preferred; no mixed domains||Definition available||Generally unique|
The approach to analysis within InfoSphere Information Analyzer varies based on the needs of the SOA project. Note that discoveries here feed into the Gap Analysis (where data is missing or inconsistent from expected sources) and into the Alignment Analysis (where data lacks standardized or consistent formats). The goal is to answer and document the core questions assessing the technical dimensions at the domain level that are relevant to the initiative. Project constraints in resources and time invariably force a focus down to primary data elements and domains.
The Domain Analysis is a detailed output showing the content of any particular column in a source store. It can tell us whether or not values are missing or outlying. Figure 3 can be used for quickly assessing both completeness and validity by looking at the distinct values.
- Is there missing data? This may be in the form of Null values or other blank value conditions (for example, spaces). Null values are not expected for most fields, but edits may allow entry of such either indicating unknown conditions at time of data entry (for example, social security number is not known or consistently collected) or a lack of processing conditions (for example, a create date exists in the table but is not populated).
- Is there defaulted data? This may be in the form of specific set values (for example, all 9’s or ‘Do Not Use’) or blank values (where a blank represents a keystroke distinct from a null value). These may significantly compromise the ability to link data appropriately for complex or composite services. For example, the social security number may be used as a trigger for internal accounting conditions such as a bankruptcy state where all accounts in such state have the same default value. As another example, the date of birth may be defaulted to an actual legitimate date such as January 1 of the current year. In both cases, high frequency or skewed data will highlight such usage.
- Is the data constant? This indicates that most of all data in a column is the same. It may indicate a homogeneous population exists but allows for future system growth, or it may reflect that the element is largely a default condition.
- Is the data unique? This indicates that all or most of the data has a single occurrence and is unique. A unique domain is most commonly an identifier, which is a potential key to the data, but also includes text-based descriptions.
- Is the data skewed? Where the frequency distribution shows data values decreasing in occurrence from some common condition to a rare condition, a skew exists. Such skews may be normal (as in a range of product prices), may indicate normalcy within the specific data (as in female patients at an OBGYN office), may indicate outlier conditions (as in a rare disease code), or may represent issues of validity.
A review of the values is only a first step in the analysis process. Definitions, annotations, and decisions need to be captured and recorded for subsequent use, whether in additional analysis steps, in reporting to key users, in putting together the actual SOA processes, or in building out a long-term knowledge base.
In InfoSphere Information Analyzer, definitions can be added directly on the Frequency Distribution text screen as shown in Figure 7 below. Think about the values present and to what degree someone looking at those values will understand them. In the example below, without context or definition a ‘C’ could stand for ‘Complete’ (the correct definition), ‘Created’, or ‘Cancelled’. For a gender field, consider which set of values is readily understood: M/F or 0/1? In the latter case, additional context, which may not be readily present, is required to make any assertion about whether ‘0’ is equal to male or female. The addition of definitions by the analyst facilitates this knowledge sharing and expands the level of Understanding for that domain.
Figure 7. Data value definitions
The Domain and Completeness screen allows you to make specific decisions around the completeness and validity of the values, whether by specific value, by range, or via a reference table. The data classification may help facilitate how to assess the domain. Code values are likely to exist in a reference table. Quantifiers and dates are likely to have a valid range. Indicator or flag values are likely to be evaluated simply as valid or invalid. Here the analyst works with subject matter experts to review the values, the available reference material, and adds the appropriate decision.
Figure 8 highlights this assessment which can also be generated as a report for business analysts and subject matter experts to review or incorporated directly into reference tables for SOA developers to utilize.
Figure 8. Data value completeness and validity
Throughout the review process, you may add notes in InfoSphere Information Analyzer for any column or table. These notes may ask or answer questions, record decisions, and track issue status or progress in working through the data. Figure 9 highlights the entry of an annotation for a given column.
Figure 9. Note entry
Over time, the knowledge base for a given source is enriched by these annotations. The notes not only provide specific information for columns or tables, but can also be utilized to track progress or status through a given source. Figure 4 above showed a report example tracking notes by note status (for example, open, pending, closed).
Questions around the structural consistency and, to some extent, the precision of the data are answered by reviewing the definitions and inferences of the column’s data properties. Most commonly, this is seen in the existence of mixed data types or mixed formats. Variations in length for string data or in precision and scale for numeric data also point to potential issues in data presentation, validation, and usage.
The Properties view within InfoSphere Information Analyzer’s Column Analysis details is a fairly complex output (Figure 10) where you can see the data type, physical structure/representation, nullability, and the like both as they are defined and as they are inferred based on actual data content. By selecting and recording the most appropriate data type, length, precision, or scale, the analyst can provide important information to the designers and developers in an SOA effort, potentially feeding back in this case to the canonical data model.
Figure 10. Data Structure and Properties Analysis
InfoSphere Information Analyzer automatically evaluates all individual columns for uniqueness, null values, and duplicates as a primary key, removing any requirement for additional processing for single columns after the column analysis. Users can extend the analysis to multiple columns, taking advantage of the built-in virtual column support noted above. Through this capability, analysts can evaluate a broad range of column combinations against data samples or specific targeted combinations against the full data volume. Results are readily displayed and allow the user to quickly understand duplicated values or drill back into the full frequency distribution for single or multiple columns as Figure 11 shows.
Figure 11: Primary Key Analysis with the duplicate check results view
In Foreign Key Analysis, InfoSphere Information Analyzer supports testing and reviewing of primary-to-foreign key relationships not only between tables of a single data source, but across multiple data sources to help address the challenges of data integration efforts. Users can draw from either defined metadata or inferences out of earlier analysis steps. Further, such analysis is cumulative, allowing the user to focus on items of interest at a specific point in time and then extending that analysis as needed to assess additional key relationships. Details from the Foreign Key Analysis allow the user to understand the overlap of actual values and quickly isolate discrepancies.
InfoSphere Information Analyzer includes an explicit test of referential integrity as part of the Foreign Key Analysis. Violations between primary keys and foreign keys, such as orphaned values or parent records without child records can be viewed as shown in Figure 12.
Figure 12: Referential Integrity Analysis
This is an analysis that determines where existing data sources have broken down and contain embedded logic in the data itself. Take the case where a column constantly contains a value of -9999 and this indicates to programs that consume this data that alternate processing should be performed. This is an example of embedded business logic contained in the column and several techniques exist directly in InfoSphere Information Analyzer to assess such logic.
First, a review of existing data formats (or the actual domain values) through the Column Analysis detail format screen may reveal unusual conditions such as the value -9999 noted above. For text data such as name, address, and product descriptions, a review of the varying data formats is critical to identify fields containing multiple domains. Where multiple domains exist, these fields should be annotated as requiring further standardization, part of the Alignment and Harmonization Analysis. Figure 13 highlights the presence of both embedded "Care of" information in an address and possible processing logic in the form of an ‘*’.
Figure 13: Data Format Analysis
Second, InfoSphere Information Analyzer allows for the concatenation of multiple columns together into Virtual Columns. Users can construct virtual columns from any combination of columns within a table and then analyze those virtual columns in turn. Users can assess the pairing of, for example, state with zip code to look for unusual formatting or even compare several virtual columns for common domain values.
Analysis for these virtual columns occurs as for any other column, allowing insight into the domain values, formats, and properties. Key insights may include: incomplete combinations (missing data/null data where a value is expected), invalid combinations (such as an incompatible pairing of state code and zip code), or potentially duplicated data (multiple occurrences of a social security number and date of birth combination).
Third, by setting up different segments of data within InfoSphere Information Analyzer,
through the use of an
Analysis Where clause option for a
table, you can look specifically at the set of column values for distinct data ranges.
This can be done based on specific values such as source system or account type, on date
ranges, or other query-based conditions. Once an
Analysis Where clause is established, Column Analysis proceeds as normal from the actual profile execution through the data review. In this case, specific business logic may be highlighted based on the data segment selected.
This is a process step that helps document current data source(s) with all of the above analysis. It may also be the enrichment of metadata from InfoSphere Information Analyzer in the unified metadata repository to be shared with InfoSphere Business Glossary or InfoSphere DataStage/QualityStage jobs. This documentation can be used in requirements specifications, functional specifications and other information services documentation as well as to document open issues.
Through InfoSphere Information Analyzer, each source column or table can have alias, definition, and business terms added as seen in Figure 14. This process can contribute to subsequent Attribute or Data Gap Analyses in mapping data sources against target model through the defined terms.
Figure 14. Adding business definitions to technical metadata
The results of running this source system analysis are outlined below:
- Complete set of reports of all aspects of each source system that show initial business and technical communities an initial overview of source data
- As results are analyzed, documentation can be captured that can be used in many contexts
- Mapping rule specifications that will feed the functional and technical specifications
- Documentation of follow-up items and issues identified that need to be addressed by the business
- Information that can be used in future projects and other areas of the organization
- Project scoping and initial risk assessment of data conversion effort
- Data extraction accelerators can be re-used for baseline assessment and for development
There are several possibilities as to what the “target system” is for the data analyst working in an SOA project:
- For service design, the target is the canonical message model representing the superset of data structures that will be exposed by all services created in the SOA project. For this case, no data is stored in the target format.
- If an operational data store (ODS) or a master data management (MDM) system is to be created to support the SOA solution, then the target is the physical data model for this data store. In some cases, the data analysis will include not only validation of the structures being used, but also an investigation into existing data that populates the target system.
- If services are being created to feed data into downstream systems, the target system is the consumer of the service. Once again, in this case, the data analyst may have to investigate not only the structure of the target, but also any data already resident there to ensure the compatibility of the incoming data.
For the target system we can perform the following tasks.
- Attribute Gap Analysis
- Data Gap Analysis
- Field Length Analysis
- Data migration scoping
This is an analysis of whether or not you can adequately define target entities and attributes and whether or not those can be mapped from existing source systems. Where the target entity already exists, InfoSphere Information Analyzer’s Cross-Domain Analysis can be used to assess these attribute relations and identify potential gaps.
In Figure 15, entities between two sources are reviewed for overlap and potential mapping including possible discrepancies in values (673 distinct values in the paired column not in the base column) and field lengths (a precision of 8 digits vs. a precision of 9 digits).
Figure 15. Cross-domain consistency assessment
Such Attribute Gap Analysis and mapping can be facilitated further through the use of the IBM Information Server FastTrack product.
This analysis determines whether or not the target model can be fulfilled with existing source data and whether or not extensive transformations may have to be performed. One example of a data gap report would be from Figure 16. This is a sample summary analysis between existing sources. This report shows quantifiably where the existing systems have overlap and to what extent in each direction. The report depicts, for example, that the Supplier column in the base ITM_MSTR table maps completely to the paired Vendno column in the ITM_SPLR table. Extended across multiple tables, this report can highlight the level of support between data.
Figure 16. Cross-Domain Analysis domains compared report
The Field Length Analysis is an analysis of whether or not current source systems can map correctly at the field length level of our target model. If not, then transformations will have to be performed in the mapping phase. This may also provide feedback to the data architects to adjust the canonical data model if necessary. This information can be derived from the report shown above in Figure15.
This is a process step that helps scope the magnitude of any required data migration effort. If the SOA solution creates an ODS or MDM system, there will likely be a data migration effort required to load the initial data into the target database. In other cases, service consumers may also require an initial data population or synchronization effort before the new solution is put into production. This scoping exercise is intended to help put some perspective on how much effort may be required, whether or not you have all the available source data that is needed and what kind of transformations may be necessary.
For any SOA solution to be a success, its services must deliver accurate data. This article has described how InfoSphere Information Analyzer can help determine through Source System and Target Analysis if planned service implementations achieve this goal of delivering trusted information from the existing data sources.
- IBM InfoSphere Information Analyzer and Data Quality Assessment:
This IBM Redbooks® publication discusses how to implement IBM InfoSphere Information Analyzer and related technologies in a typical financial services business scenario.
- In the
Information Integration area on developerWorks,
get the resources you need to advance your skills on IBM's Information Platform & Solutions portfolio of products.
- Browse the
for books on these and other technical topics.
Get products and technologies
IBM product evaluation versions
and get your hands on application development tools and middleware products from
DB2®, Lotus®, Rational®, Tivoli®, and
David McCarty is based at IBM's European Business Solution Center in La Gaude, France and has 20 years experience designing and developing IT systems with IBM customers. He is currently a member of the Information as a Service Competency Center developing techniques and best practices for leveraging data systems in SOA solutions.
Guenter Sauter is an architect in the Information Platform & Solutions segment within IBM's software group. He is driving architectural patterns and usage scenarios across IBM's master data management and information platform technologies. Until recently, he was the head of an architect team developing the architecture approach, patterns and best practices for Information as a Service. He is the technical co-lead for IBM's SOA Scenario on Information as a Service.
Harald Smith is a product manager for data profiling and analysis solutions in IBM's Information Management software group. He has been directly involved in Information Quality work for the last nine years, including work in consulting and product software development and support. With more than 25 years of working with information technology and solutions, Harald has extensive experience in system implementations, solution architecture, project methodology, data auditing and analysis, entity and object modeling, and business process re-engineering. He has worked in software, financial services, healthcare, and education sectors.
Peter joined IBM three years ago after almost 25 years at institutions like the US Dept. of Defense, GE Corporate and Morgan Stanley where he held technical leadership positions and gained valuable experience in Enterprise Architecture and Enterprise Data Integration. He initially joined IBM as a Sr. IT Architect as part of the architect team for Information as a Service. Currently he is a Solutions Marketing Manager for the IPS Global Services organization, specializing in MDM solutions.