Domain analysis

Domain analysis is used to identify records that have invalid data values in the column. It is important for you to know what percentage of a column has invalid data.

Function

Only complete data values are considered when determining invalid data values. Data values that are judged to be incomplete take precedence over data values that are invalid.

Domain analysis technique

The approach to domain analysis is that you, or the system, review the column's distinct data values to mark any data value considered to be “invalid”. The system has multiple types of domain analysis (for example, techniques) that can be used to perform the function. The criteria differ for determining valid from invalid data values, but all result in identifying and marking the column's data values as invalid when appropriate. The system assumes which type to use based on the data class of the column in question. However, you can choose to use any of the domain analysis types for any column regardless of its data class. The three domain analysis types are:

Value, where you manually review all data values
Range, where you set minimum and maximum valid data values
Reference File, where the system uses an external validity file

System capability

During column analysis processing, the system constructs each column's frequency distribution. During your review of the results, the system presents the domain analysis results in one of two ways: Value or Range, based on the data class of the column. The user interface contains a domain type selection box that allows you to change to a different domain analysis type if desired.

For columns that are not data class QUANTITY or DATE, the system assumes the Value domain analysis type will be used. The distinct data values from the frequency distribution are displayed in a grid that includes a validity flag for each distinct data value. All of the validity flags in the frequency distribution were set to valid when the frequency distribution was created.

Note: Some data values might have already been set to default if completeness analysis was previously performed.

You can examine the data values and change any data value's validity flag from valid to invalid. As you set validity flags to invalid for distinct data values, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage, for invalid data on the screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the column's domain results when all of the data values have been inspected and the domain and completeness analysis has been marked as reviewed, as shown in the following figure.

Figure 1. An example of the statistical results and flag settings

Shows the statistical results and flag settings that can be saved in the system and ultimately can be recorded in the repository as the column's domain results.

For columns that are data class QUANTITY or DATE, the system assumes the Range domain analysis type will be used. The frequency distribution's outliers (a number of the lowest distinct data values and the same number of the highest distinct data values from the sorted frequency distribution) are displayed in a grid that includes a validity flag for each distinct data value. All of the validity flags in the frequency distribution will have been set to valid when the frequency distribution was created.

Note: Some data values might have already been set to “default” if completeness analysis was performed prior to domain analysis.

You visually inspect the displayed low and high values and can change the validity flag on any of those data values to minimum (for example, a low value) or to maximum (for example, a high value) to establish the range of valid values. If the required minimum or maximum data value is not in the display, you can increase the number of outliers being displayed by the system. Also, if the minimum or maximum data value required is not in the frequency distribution, you can manually enter the appropriate data values into a text field.

The system can also provide additional domain analysis for Range if you need it. You can request that the system perform a quintile analysis of the column's frequency distribution. In a quintile analysis, the system divides the sorted (for example, numeric sort for numeric columns, chronological sort for date and time columns, and alpha sort for alphanumeric columns) frequency distribution into five segments with an equal number of distinct data values.

Note: The fifth segment might not be equal if the cardinality is not perfectly divisible by five.

After the system analysis process is completed, the system displays a graph that shows the distribution of the records across all five segments as well as the low and high distinct data values of each segment, and so on. This holistic view of the column's data might help you to determine where the minimum and the maximum values for the column should be set.

As you set validity flags to minimum or maximum, the system automatically sets the validity flags for data values lower than the minimum value or higher than the maximum value to invalid unless the flag is already set to default. Both the minimum value and the maximum value are considered to be valid values. As this system operation happens, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage, for invalid data on the screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the domain results when all of the data values have been inspected and the domain and completeness analysis flag has been marked as reviewed, as shown in the following figure.

Figure 2. An example of the statistical results and flag settings

Shows the statistical results and flag settings that can be saved in the system and can be recorded in the repository as the domain results.

The third domain analysis type is known as the Reference File type. If you have an external file known to contain all of the valid data values permissible in the column (for example, a zip code file), it can be used by the system to automatically detect invalid values in the column's frequency distribution.

The system will take each distinct data value from the frequency distribution and search the external file to see if that value is in the file. Any distinct data value not found on the external file will have its validity flag automatically set to “invalid” by the system.

As you or the system sets validity flags to invalid for distinct data values, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage for invalid data on the user interface screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the domain results when all of the data values have been evaluated and the domain and completeness analysis flag has been marked as reviewed.

User responsibility

You view the domain analysis when the Column Analysis review of columns is viewed. At the detailed column view, domain and completeness analysis has its own tab for viewing results. The system will present either the Value Domain Type window or the Range type window based on the column's data class. You can either accept that domain analysis type and proceed, or choose a more preferred domain analysis type to be used for the column.

Once the appropriate domain analysis type has been selected, you focus on different aspects of the process depending upon which domain analysis type is chosen.

For Value Domain Type, the key task is the to examine the data values. This should be done carefully for columns that have a relatively small cardinality (for example, CODE or INDICATOR columns). If the column has a large cardinality (for example, TEXT or IDENTIFIER columns), it might be too time-consuming to inspect each value. In that case, two approaches that can produce the right results in less time include looking at the high end and low end of the frequency distribution first by data value and then by frequency count.

If the column has data values that can be logically sorted (for example, DATE, QUANTITY, or sequentially assigned IDENTIFIER columns), Range type is the most efficient means to perform domain analysis. An appropriate minimum and maximum value should be carefully chosen to define the valid range.

If a reliable external file exists, Reference File is the preferred domain analysis type. Some caution should be used in using reference files created internally in the organization as opposed to reference files made available by external organizations. An internal reference file can sometimes be inaccurate and the root cause of invalid data found in the data sources.

For Range and Reference File, the domain analysis process is completed after you are satisfied with the results. For the Value Domain type, the process is completed after the visual inspection of all relevant data values has been performed.

Finally, you flag the domain and completeness status as completed for that column.

Interpreting results

Typically, most of the invalid data values for a column will be obvious to you upon examination. The causes of this invalid data can vary from inaccurate data capture, system processing defects or inaccurate data conversions.

The significance of the amount of invalid data in a column is dependent upon the nature of the column and its impact on business processes and system operations. If the analysis of data is driven by the need to use the data as source for new data integration, it is critical that the data either be corrected or that the integration system analysts be aware of the invalid data and how it should be treated.

To improve and maintain the quality of domain values, it is also important to trace problems to their root causes so that you can take corrective action and prevent further proliferation of the problems.

Decisions and actions

You have multiple decisions to make for each column for domain analysis. You can choose the appropriate domain analysis type to be used for the column. Then you must decide the validity of individual data values, determine the minimum and maximum valid data values, or determine the availability of an acceptable external reference file.

After all the decisions for a column have been made, you can continue to review the column's domain values for validity or can mark the column domain and completeness review as complete.

At this point, you can also request the creation of a validity table that will create a single-column reference file that contains all of the valid distinct data values from the column's frequency distribution. This validity table can be exported to other products or systems for application to the data source, as shown in the following figure.

Figure 3. An example of the window where you select the reference table type

Also, if you wish to initiate corrective action on the invalid data, you can begin the process for any distinct data value flagged as invalid by entering a new replacement value into a transformation value field in the frequency distribution grid. You can then request the creation of a mapping table that will create a two-column reference file with the first column containing the existing data value, and the second column containing the entered replacement value. This mapping table can be exported to other products or systems for application to the data source.

Performance considerations

There are two important performance considerations when conducting domain analysis.

When using value domain analysis type, you should be aware of the significant time required to inspect every data value for columns with high cardinality in their frequency distribution.
Also, when using range domain analysis type, requesting a quintile analysis can create a lengthy processing job for columns with high cardinality in their frequency distribution.