IBM InfoSphere Foundation Tools, IBM InfoSphere Information Server, Version 8.5

Data profiling and analysis

IBM® InfoSphere® Information Analyzer provides extensive capabilities for profiling source data. The main data profiling functions are column analysis, primary key analysis, natural key analysis, foreign-key analysis, and cross-domain analysis.

Column analysis

Column analysis generates a full-frequency distribution and examines all values for a column to infer its definition and properties such as domain values, statistical measures, and minimum and maximum values. Each column of every source table is examined in detail. The following properties are observed and recorded:

Count of distinct values or cardinality
Count of empty values, null values, and non-null or empty values
Minimum, maximum, and average numeric values
Basic data types, including different date-time formats
Minimum, maximum, and average length
Precision and scale for numeric values

Figure 1. Column analysis example data view

InfoSphere Information Analyzer column analysis view

InfoSphere Information Analyzer also enables you to drill down on specific columns to define unique quality control measures for each column. Figure 1 shows results for a table named GlobalCo_Ord_Dtl. At the top is a summary analysis of the entire table. Beneath the summary is detail for each column that shows standard data profiling results, including data classification, cardinality, and properties. When you select a column, additional tasks that are relevant to that level of analysis become available.

Another function of column analysis is domain analysis. A domain is a valid set of values for an attribute. Domain analysis determines the data domain values for any data element. By using a frequency distribution, you can facilitate testing by providing a list of all the values in a column and the number of occurrences of each. Domain analysis checks whether a data element corresponds to a value in a database table or file. Figure 2 shows a frequency distribution chart that helps find anomalies in the QTYORD column.

Figure 2. Column analysis example graphical view

InfoSphere Information Analyzer frequency distribution for the Qtyord column

The bar chart shows data values on the y-axis and the frequency of those values on the x-axis. This detail points out default and invalid values based on specific selection, ranges, or reference sources, and aids you in iteratively building quality metrics.

When you validate free-form text, it can be difficult to analyze and understand the extent of the quality issues. InfoSphere Information Analyzer can show each data pattern of the text for a detailed quality investigation. It helps with the following tasks:

Uncovering trends, potential anomalies, metadata discrepancies, and undocumented business practices
Identifying invalid or default formats and their underlying values
Verifying the reliability of fields that are proposed as matching criteria for input to IBM InfoSphere QualityStage® and IBM InfoSphere DataStage®

Primary key analysis

Primary key analysis information allows you to validate the primary keys that are already defined in your data and identify columns that are candidates for primary keys.

Columns that have a high percentage of uniqueness, as well as at least one foreign key candidate, make the best primary key candidates. You can analyze and identify the columns that have the highest uniqueness percentage and assign one of the columns as the primary key. You might have only one primary key per table. For example, if you have a customer table with a customer ID that was auto-generated for identification purposes, and other tables in your data source link to that customer ID, the customer ID is a good candidate for a primary key. It is unique and was auto-generated for the purpose of being a primary unique identifier.

Primary key analysis presents all of the columns and the potential primary key candidates. A duplicate check validates the use of such keys. You select the primary key candidate based on its probability for uniqueness and your business knowledge of the data involved. If you select a multi-data column as the primary key, the system will develop a frequency distribution for the concatenated values.

Natural key analysis

Natural keys are meaningful values that identify records, such as Social Security numbers; that identify specific customers, calendar dates in a time dimension, or SKU numbers in a product dimension. A natural key is a column that has a logical relationship to other pieces of data within the record.

Natural key analysis uses the frequency distribution statistics that are gathered when you analyze columns to profile the uniqueness of distinct values in each column of a table. You can review the natural key analysis results and select a natural key from the list of all natural key candidates.

Identifying natural keys to classify your data is beneficial because it allows you the flexibility to have multiple natural keys that are associated with multiple foreign keys, rather than having one primary key that is associated with multiple foreign keys. After you select a natural key, you can automatically identify foreign keys by running a relationship analysis on the column. You can then view the number of foreign key candidates and the details associated with them.

Foreign-key analysis

Columns that have a high percentage of values identical to those contained in a primary or natural key make the best foreign keys. When viewing the foreign-key candidates for a particular primary or natural key, highlight the columns that have the highest paired-to-base percentage. A percentage of 100% indicates that every value within the paired column exists within the base column (primary key or natural key column).

A column qualifies as a foreign-key candidate if the majority (for example, 98% or higher) of its frequency distribution values match the frequency distribution values of a primary key column. After you select a foreign key, the system performs a bidirectional test (foreign key to primary key, primary key to foreign key) of each foreign key's referential integrity and identifies the number of referential integrity violations and "orphan" values (keys that do not match).

Cross-domain analysis

Cross-domain analysis examines content and relationships across tables. This analysis identifies overlaps in values between columns, and any redundancy of data within or between tables. For example, country or region codes might exist in two different customer tables and you want to maintain a consistent standard for these codes. Cross-domain analysis enables you to directly compare these code values.

InfoSphere Information Analyzer uses the results of column analysis for each set of columns that you want to compare. The existence of a common domain might indicate a relationship between tables or the presence of redundant fields. Cross-domain analysis can compare any number of domains within or across sources.

This topic is also in the IBM InfoSphere Information Server Introduction.

Update timestamp

Last updated: 2012-9-20