Data profiling and analysis

WebSphere^® Information Analyzer provides extensive capabilities for profiling source data. The four main data profiling functions are column analysis, primary key analysis, foreign key analysis, and cross-domain analysis.

Column analysis

Column analysis generates a full frequency distribution and examines all values for a column to infer its definition and properties such as domain values, statistical measures, and minimum and maximum values. Each column of every source table is examined in detail. The following properties are observed and recorded:

Count of distinct values or cardinality
Count of empty values, null values, and non-null or empty values
Minimum, maximum, and average numeric values
Basic data types, including different date-time formats
Minimum, maximum, and average length
Precision and scale for numeric values

Figure 1. Column analysis example data view

WebSphere Information Analyzer column analysis view

WebSphere Information Analyzer also enables you to drill down on specific columns to define unique quality control measures for each column. Figure 1 shows a closer look at results for a table named GlobalCo_Ord_Dtl. At the top is a summary analysis of the entire table. Beneath the summary is detail for each column that shows standard data profiling results, including data classification, cardinality, and properties. When you select a column, additional tasks that are relevant to that level of analysis become available.

Another function of column analysis is domain analysis. A domain is a valid set of values for an attribute. Domain analysis determines the data domain values for any data element. By using a frequency distribution, you can facilitate testing by providing a list of all the values in a column and the number of occurrences of each. Domain analysis checks whether a data element corresponds to a value in a database table or file. Figure 2 shows a frequency distribution chart that helps find anomalies in the Qtyord column.

Figure 2. Column analysis example graphical view

WebSphere Information Analyzer frequency distribution for the Qtyord column

The bar chart shows data values on the y-axis and the frequency of those values on the x-axis. This detail points out default and invalid values based on specific selection, ranges, or reference sources, and aids you in iteratively building quality metrics.

When you are validating free-form text, analyzing and understanding the extent of the quality issues is often very difficult. WebSphere Information Analyzer can show each data pattern of the text for a much more detailed quality investigation. It helps with the following tasks:

Uncovering trends, potential anomalies, metadata discrepancies, and undocumented business practices
Identifying invalid or default formats and their underlying values
Verifying the reliability of fields that are proposed as matching criteria for input to WebSphere QualityStage and WebSphere DataStage™

Primary key analysis

The primary key of a relational table is a unique identifier that a database uses to access a specific row. Primary key analysis identifies all candidate keys for one or more tables and helps you test a column or combination of columns to determine if it is a candidate for becoming the primary key. Figure 3 shows a single-column analysis.

Figure 3. Primary key analysis

Primary key analysis for a single column

The analysis presents all of the columns and the potential primary key candidates. A duplicate check validates the use of such keys. You select the primary key candidate based on its probability for uniqueness and your business knowledge of the data involved. If you select a multi-data column as the primary key, the system will develop a frequency distribution for the concatenated values.

Foreign key analysis

Foreign key analysis examines content and relationships across tables. This analysis helps identify foreign keys, check their integrity, and check the referential integrity between the primary key and foreign keys. For example, in a Bill of Materials structure, the parent-child relationships among assemblies and subassemblies would require you to identify relationships between foreign keys and primary keys and validate their referential integrity.

A column qualifies to be a foreign key candidate if the majority (for example, 98 percent or higher) of its frequency distribution values match the frequency distribution values of a primary key column. As Figure 4 shows, after you select a foreign key, the system performs a bidirectional test (foreign key to primary key, primary key to foreign key) of each foreign key’s referential integrity and identifies the number of referential integrity violations and "orphan" values (keys that do not match).

Figure 4. Foreign key analysis

Foreign key analysis view in WebSphere Information Analyzer

Cross-domain analysis

Cross-domain analysis examines content and relationships across tables. This analysis identifies overlaps in values between columns, and any redundancy of data within or between tables. For example, country or region codes might exist in two different customer tables and you want to maintain a consistent standard for these codes. Cross-domain analysis enables you to directly compare these code values.

WebSphere Information Analyzer uses the results of column analysis for each set of columns that you want to compare. The existence of a common domain might indicate a relationship between tables or the presence of redundant fields. Cross-domain analysis can compare any number of domains within or across sources.