IBM InfoSphere Information Analyzer, Version 11.3.1
You use the data profiling process to evaluate the quality of your data. The data profiling process consists of multiple analyses that investigate the structure and content of your data, and make inferences about your data. After an analysis completes, you can review the results and accept or reject the inferences.
The data profiling process consists of multiple analyses that work together to evaluate your data:
Column analysis is a prerequisite to all other analyses except for cross-domain analysis. During a column analysis job, the column or field data is evaluated in a table or file and a frequency distribution is created. A frequency distribution summarizes the results for each column such as statistics and inferences about the characteristics of your data. You review the frequency distribution to find anomalies in your data. When you want to remove the anomalies in your data, you can use a data cleansing tool such as IBM® InfoSphere® QualityStage® to remove the values.
A frequency distribution is also used as the input for subsequent analyses such as primary key analysis and baseline analysis.
The column analysis process incorporates four analyses:
During a key and cross-domain analysis job, your data is assessed for relationships between tables. The values in your data are evaluated for foreign key candidates, and defined foreign keys. A column might be inferred as a candidate for a foreign key when the values in the column match the values of an associated primary or natural key. If a foreign key is incorrect, the relationship that it has with a primary or natural key in another table is lost.
After a key and cross-domain analysis job completes, you can run a referential integrity analysis job on your data. Referential integrity analysis is an analysis that you use to fully identify violations between foreign key and primary or natural key relationships. During a referential integrity analysis job, foreign key candidates are investigated at a concise level to ensure that they match the values of an associated primary key or natural key.
A key and cross-domain analysis job will also help you to determine whether multiple columns share a common domain. A common domain exists when multiple columns contain overlapping data. Columns that share a common domain might signal the relationship between a foreign key and a primary key, which you can investigate further during a foreign key analysis job. However, most common domains represent redundancies between columns. If there are redundancies in your data, you might want to use a data cleansing tool to remove them because redundant data can take up memory and slow down the processes that are associated with them.
You run a baseline analysis job to compare a prior version of analysis results with the current analysis results for the same data source. If differences between both versions are found, you can assess the significance of the change, such as whether the quality has improved.
The following figure shows how data profiling analyses work together:
