Information icon IBM Information Server, Version 8.1
Feedback

Data profiling and analysis

WebSphere® Information Analyzer provides extensive capabilities for profiling source data. The four main data profiling functions are column analysis, primary key analysis, foreign key analysis, and cross-domain analysis.

Column analysis

Column analysis generates a full frequency distribution and examines all values for a column to infer its definition and properties such as domain values, statistical measures, and minimum and maximum values. Each column of every source table is examined in detail. The following properties are observed and recorded:

Figure 1. Column analysis example data view
WebSphere Information Analyzer column analysis view

WebSphere Information Analyzer also enables you to drill down on specific columns to define unique quality control measures for each column. Figure 1 shows a closer look at results for a table named GlobalCo_Ord_Dtl. At the top is a summary analysis of the entire table. Beneath the summary is detail for each column that shows standard data profiling results, including data classification, cardinality, and properties. When you select a column, additional tasks that are relevant to that level of analysis become available.

Another function of column analysis is domain analysis. A domain is a valid set of values for an attribute. Domain analysis determines the data domain values for any data element. By using a frequency distribution, you can facilitate testing by providing a list of all the values in a column and the number of occurrences of each. Domain analysis checks whether a data element corresponds to a value in a database table or file. Figure 2 shows a frequency distribution chart that helps find anomalies in the Qtyord column.

Figure 2. Column analysis example graphical view
WebSphere Information Analyzer frequency distribution for the Qtyord column

The bar chart shows data values on the y-axis and the frequency of those values on the x-axis. This detail points out default and invalid values based on specific selection, ranges, or reference sources, and aids you in iteratively building quality metrics.

When you are validating free-form text, analyzing and understanding the extent of the quality issues is often very difficult. WebSphere Information Analyzer can show each data pattern of the text for a much more detailed quality investigation. It helps with the following tasks:

Primary key analysis

The primary key of a relational table is a unique identifier that a database uses to access a specific row. Primary key analysis identifies all candidate keys for one or more tables and helps you test a column or combination of columns to determine if it is a candidate for becoming the primary key. Figure 3 shows a single-column analysis.

Figure 3. Primary key analysis
Primary key analysis for a single column

The analysis presents all of the columns and the potential primary key candidates. A duplicate check validates the use of such keys. You select the primary key candidate based on its probability for uniqueness and your business knowledge of the data involved. If you select a multi-data column as the primary key, the system will develop a frequency distribution for the concatenated values.

Foreign key analysis

Foreign key analysis examines content and relationships across tables. This analysis helps identify foreign keys, check their integrity, and check the referential integrity between the primary key and foreign keys. For example, in a Bill of Materials structure, the parent-child relationships among assemblies and subassemblies would require you to identify relationships between foreign keys and primary keys and validate their referential integrity.

A column qualifies to be a foreign key candidate if the majority (for example, 98 percent or higher) of its frequency distribution values match the frequency distribution values of a primary key column. As Figure 4 shows, after you select a foreign key, the system performs a bidirectional test (foreign key to primary key, primary key to foreign key) of each foreign key’s referential integrity and identifies the number of referential integrity violations and "orphan" values (keys that do not match).

Figure 4. Foreign key analysis
Foreign key analysis view in WebSphere Information Analyzer

Cross-domain analysis

Cross-domain analysis examines content and relationships across tables. This analysis identifies overlaps in values between columns, and any redundancy of data within or between tables. For example, country or region codes might exist in two different customer tables and you want to maintain a consistent standard for these codes. Cross-domain analysis enables you to directly compare these code values.

WebSphere Information Analyzer uses the results of column analysis for each set of columns that you want to compare. The existence of a common domain might indicate a relationship between tables or the presence of redundant fields. Cross-domain analysis can compare any number of domains within or across sources.

Related concepts
Data monitoring and trending
Results of the analysis

PDF This topic is also in the IBM Information Server Introduction.

Update icon Last updated: 2008-09-15