Data classification analysis

The data classification analysis function is the process of assigning columns into meaningful categories that can be used to organize and focus subsequent analysis work.

Function

The following attributes in IBM® InfoSphere® Information Analyzer can be used for data classification.

Data Class (system-inferred) – a system-defined semantic business use categories for the column
Data Sub-Class (optional) – a user-defined semantic business use category within a data class
User Class (optional) – a user-defined category independent of data class

For the system-inferred data class, a column is categorized into one of the following system-defined data classification designations:

IDENTIFIER: Columns that contain generally non-intelligent data values that reference a specific entity type (for example, a customer number).

CODE: Columns that contain finite data values from a specific domain set, each of which has a specific meaning (for example, a product status code).

INDICATOR: Similar to a code except there are only two permissible binary values in the domain set (for example, a yes/no indicator).

DATE: Columns that contain data values that are a specific date, time, or duration (for example, a product order date).

QUANTITY: Columns that contain numerical data that could be used in a computation (for example, a product price).

TEXT: Columns that contain freeform alphanumeric data values from an unlimited domain set (for example, a product description).

LARGE OBJECT: Columns that contain large-object data (for example, a product image).

UNKNOWN: Columns that cannot be classified into one of the above classes by the system algorithm.

There are multiple system reports that can display the columns and their data class designation.

Technique

The initial data classification designation for a column is inferred by the system during column analysis processing after the column's frequency distribution has been created. The system uses an algorithm that factors in the cardinality, data type, uniqueness, and length of the column's data values to infer its likely data class. Under certain conditions, the algorithm might not be able to produce a specific data class designation, in which case the system assigns a designation of UNKNOWN.

During column analysis review, the system-inferred data classification of the column for review is presented. You can accept that inference as true or override that inference based on their knowledge or research of the column and its data.

System capability

The system automatically applies the data classification algorithm to each column whenever it performs column analysis processing. Using a system algorithm, InfoSphere Information Analyzer analyzes key information about the column to derive the most probable data class that the column belongs to. This system-inferred data classification designation is then recorded in the repository as the inferred selection and by default becomes the chosen selection, as shown in the following example.

Figure 1. An example of inferred data classification designation

Shows an example of inferred data classification designation

User responsibility

You views the system-inferred data classification designation during the column analysis review of columns. At the detailed column view, data classification has its own tab for viewing results. On this panel, you can accept the system-inferred data class or override the system's inference by selecting another data classification designation. If you override the system-inferred data class selection, the new selection is recorded in the repository as the chosen selection. The process is completed when you mark the data classification function for the column as reviewed.

Interpreting results

Typically, you might know each column by its name or alias so the decision to accept or override the system-inferred data classification is straightforward. However, when there is little or no familiarity with the column, a close inspection of the frequency distribution values and the defined data type will help you either confirm the system-inferred data class or choose another more appropriate data class.

For example, a common situation occurs when the column is a numeric data type and the choice is between CODE or QUANTITY data class. In general, CODE tends to have a lower cardinality with a higher frequency distribution count per data value than QUANTITY does, but there can be exceptions (for example, a Quantity Ordered field).

Frequently a column that is obviously an INDICATOR (the column name often includes the word/string ‘Flag') is inferred as a CODE because the presence of a third or fourth value in the frequency distribution (for example, “Y”, “N”, and null). In those cases, it is recommended that the column's data class be set to INDICATOR and that the extra values be marked as Incomplete or Invalid from the Domain Analysis screen or even corrected or eliminated in the source (for example, convert the nulls to “N”).

Also, sometimes choosing between CODE and IDENTIFIER can be a challenge. In general, if the column has a high percentage of unique values (for example, frequency distribution count equal to 1), it is more likely to be an IDENTIFIER data class.

Another data classification issue is the need to maintain the consistency of data classification designations across columns in the data environment. A common problem is a column (for example, customer number) that is the primary key of a table and thus has all the characteristics of an IDENTIFIER. However, the same column might also appear in another table as a foreign key and have all the characteristics of a CODE, even though it is in essence still an IDENTIFIER column. This is particularly true where the set of valid codes exist in a Reference Table and the codes are simultaneously the values and the identifiers.

With regard to data marked with an UNKNOWN data classification, the most common condition is that the values are null, blank, or spaces, and no other inference could be made. In this situation, the column is most likely not used and the field should be marked accordingly with a note or with a user-defined classification (such as EXCLUDED or NOT USED). In some scenarios, this might represent an opportunity to remove extraneous columns from the database; while in data integration efforts, these are fields to ignore when loading target systems.

Decisions and actions

You have only one decision to make for each column in data classification. You either accept the system-inferred data class or override the inferred data class by selecting another.

After that decision has been made, you can mark the column reviewed for data classification; or, you can add the optional data subclass and user class designations prior to marking the review status complete.

However, data classification provides a natural organizing schema for subsequent analysis, particularly for data quality controls analysis. The following table indicates common evaluation and analysis necessary based on each system-inferred data class. You can establish particular criteria for analysis based on user-defined data classes.

Table 1. Data classification analysis considerations and actions
Data class	Properties analysis	Domain analysis	Validity and format analysis
Identifier	Evaluate for Data Type, Length, Nulls, Unique Cardinality	Confirm maximum and minimum values Validate format (if text)	Look for out of range conditions (if applicable) Inconsistent or Invalid formats (if text/if applicable)
Indicator	Evaluate for Length, Nulls, Constant Cardinality	Confirm Valid values Assess Skewing and default values	Mark Invalid values
Code	Evaluate for Length, Nulls, Constant Cardinality	Confirm Valid values Assess Skewing and default values	Mark Invalid values
Date/Time	Evaluate for Data Type, Nulls, Constant Cardinality	Confirm valid values Assess Skewing and default values	Look for Out of range (if applicable) Mark Inconsistent or Invalid formats (if text or number)
Quantity	Evaluate for Data Type, Precision, Scale	Confirm Valid values Assess Skewing and default values	Look for Out of range (if applicable) Mark Inconsistent or Invalid formats (if text or number)
Text	Evaluate for Data Type, Length, Nulls, Unique, or Constant Cardinality	Evaluate for Default values, format requirements, and special characters	Mark Invalid special characters Mark Invalid format (if applicable)

Performance considerations

There are no system performance considerations for the data classification function.