Data profiling process

You use the data profiling process to evaluate the quality of your data. The data profiling process consists of multiple analyses that investigate the structure and content of your data, and make inferences about your data. After an analysis completes, you can review the results and accept or reject the inferences.

The data profiling process consists of multiple analyses that work together to evaluate your data:

Column analysis

Column analysis is a prerequisite to all other analyses except for cross-domain analysis. During a column analysis job, the column or field data is evaluated in a table or file and a frequency distribution is created. A frequency distribution summarizes the results for each column such as statistics and inferences about the characteristics of your data. You review the frequency distribution to find anomalies in your data. When you want to remove the anomalies in your data, you can use a data cleansing tool such as IBM® InfoSphere® QualityStage® to remove the values.

A frequency distribution is also used as the input for subsequent analyses such as primary key analysis and baseline analysis.

The column analysis process incorporates four analyses:

Domain analysis: Identifies invalid and incomplete data values. Invalid and incomplete values impact the quality of your data by making the data difficult to access and use. You can use the results from domain analysis when you want to use a data cleansing tool to remove the anomalies.

Data classification analysis: Infers a data class for each column in your data. Data classes categorize data. If your data is classified incorrectly, it is difficult to compare the data with other domains of data. You compare domains of data when you want to find data that contains similar values.

Format analysis: Creates a format expression for the values in your data. A format expression is a pattern that contains a character symbol for each distinct character in a column. For example, each alphabetic character might have a character symbol of A and numeric characters might have a character symbol of 9. Accurate formats ensure that your data is consistent with defined standards.

Data properties analysis: Data properties analysis compares the accuracy of defined properties about your data before analysis to the system-inferred properties that are made during analysis. Data properties define the characteristics of data such as field length or data type. Consistent and well-defined properties help to ensure that data is used efficiently.

Key and cross-domain analysis

During a key and cross-domain analysis job, your data is assessed for relationships between tables. The values in your data are evaluated for foreign key candidates, and defined foreign keys. A column might be inferred as a candidate for a foreign key when the values in the column match the values of an associated primary or natural key. If a foreign key is incorrect, the relationship that it has with a primary or natural key in another table is lost.

After a key and cross-domain analysis job completes, you can run a referential integrity analysis job on your data. Referential integrity analysis is an analysis that you use to fully identify violations between foreign key and primary or natural key relationships. During a referential integrity analysis job, foreign key candidates are investigated at a concise level to ensure that they match the values of an associated primary key or natural key.

A key and cross-domain analysis job will also help you to determine whether multiple columns share a common domain. A common domain exists when multiple columns contain overlapping data. Columns that share a common domain might signal the relationship between a foreign key and a primary key, which you can investigate further during a foreign key analysis job. However, most common domains represent redundancies between columns. If there are redundancies in your data, you might want to use a data cleansing tool to remove them because redundant data can take up memory and slow down the processes that are associated with them.

Baseline analysis

You run a baseline analysis job to compare a prior version of analysis results with the current analysis results for the same data source. If differences between both versions are found, you can assess the significance of the change, such as whether the quality has improved.

The following figure shows how data profiling analyses work together:

Figure 1. Data profiling analyses

Note: You do not need to run a column analysis job before you run a cross-domain analysis job.