Commonality analysis

The commonality analysis function identifies pairs of columns that have a significant number of common domain values. The columns might or might not be in the same data source, and might or might not have the same column name. The function is used to find like columns and redundant columns, and is used within the foreign key analysis function.

Analysis technique

The commonality analysis approach is that the system compares the frequency distribution data values for pairs of columns to determine the percentage of common data values one column has that are also in the other column.

The user defines the tables or columns to participate in the analysis. The system generates all the possible column pairings. (For example, pair A-B is included, but pair B-A is not because it is redundant with A-B.) The total list of generated column pairs is subjected to a column pair compatibility test which eliminates any pair from the list whose columns can be predetermined not to be capable of having common data values. The system uses several column properties (for example, data type, length, general formats, and so forth). Those column pairs that pass the column pair compatibility test continue into the next analysis step.

For each compatible column pair, the frequency distribution data values from the first column are compared to the frequency distribution data values of the second column. The process is repeated again with the second column compared to the first column. The analysis is completed with the recording of the A to B and the B to A commonality percentage. The computed percentages are then compared to the applicable common domain threshold percentage in the analysis options. If the percentage is equal or greater than the common domain threshold percentage, the column pair (for example, A-B or B-A) is flagged as being in common.

System capability

The commonality analysis is initiated by the user selecting the tables or columns to participate in the analysis. The system uses these selections to generate the list of every possible column pair combination. The system then proceeds to perform the column pair compatibility test on the full list of column pairs. When that step is completed, the system displays the intermediate results including summary statistics of how many column pairs were generated and how many of those passed or failed the compatibility test. The system also displays a detailed list of the column pairs that passed and would be used in the next analysis step. Review this information and decide whether to continue the process or to return to the original data selection step for modifications.

If you choose to continue, the system begins the process of comparing frequency distribution data values for each column pair in turn until all of the column pairs have been analyzed. The process captures the commonality percentages for each pair and any commonality flags that were set. One feature of the analysis is that if the two columns have had their domain values previously compared, it was recorded in a history file with the date the comparison was performed. The system will use the percentages from the history file unless one of the column's frequency distributions has been updated since that date.

When you initiate the commonality review, the system presents the results with a tab for each table used in the process. Each tab displays a table's columns and any paired columns (for example, any other column from any other table) that was flagged as having common domain values. If you choose, the system can also display the non-flagged pairs and can flag a column pair as being redundant. The following figure provides an example.

Figure 1. An example of a table's columns and any paired columns that was flagged as having common domain values

Shows a table's columns and any paired columns that was flagged as having common domain values.

User responsibility

You initiate the commonality analysis process by selecting tables or columns to be analyzed.

After generating the column pairs and performing the column pair compatibility tests, the system displays the interim results to the user. After reviewing those results, decide whether to continue the process or return to the data selection step.

When the system completes the commonality analysis, view the results. Each table and its columns are displayed on a tab that shows all other columns that have commonality with the columns in that table. If needed, you can flag any of the column pairs as containing a redundant column.

Interpreting results

When viewing the interim results, focus on two areas.

First, evaluate the total number of column pairs to be processed in the next step. The number generally should be 20% or less of the total number of column pairs generated. It should also be a reasonable workload for the computing resources being used.
Second, verify that any important column pairs have successfully passed the compatibility test.

When the actual commonality results are reviewed, keep in mind that some of the column pairs flagged as common are to be expected. Every foreign key column should result in commonality with its corresponding primary key column. Those column pairs should not be flagged as redundant. For column pairs that do not have a key relationship, your judgment is needed to identify columns that are truly redundant.

Decisions and actions

You have several key decisions to make during commonality analysis.

Selecting tables and columns to be analyzed
Proceeding with analysis based on interim results
Identifying columns that are truly redundant

Performance considerations

The most significant system performance consideration for the commonality analysis function is the total number of column pairs that can be reasonably processed as one job.