Handling Outliers and Missing Values

The Quality tab in the audit report displays information about outliers, extremes, and missing values.

Figure 1. Data Audit browser, Quality tab
Data Audit browser, Quality tab

You can also specify methods for handling these values and generate SuperNodes to automatically apply the transformations. For example you can select one or more fields and choose to impute or replace missing values for these fields using a number of methods, including the C&RT algorithm.

Figure 2. Choosing an impute method
Choosing an impute method

After specifying an impute method for one or more fields, to generate a Missing Values SuperNode, from the menus choose:

Generate > Missing Values SuperNode

Figure 3. Generating the SuperNode
Generating the SuperNode

The generated SuperNode is added to the stream canvas, where you can attach it to the stream to apply the transformations.

Figure 4. Stream with Missing Values SuperNode
Stream with Missing Values SuperNode

The SuperNode actually contains a series of nodes that perform the requested transformations. To understand how it works, you can edit the SuperNode and click Zoom In.

Figure 5. Zooming in on the SuperNode
Zooming in on the SuperNode

For each field imputed using the algorithm method, for example, there will be a separate C&RT model, along with a Filler node that replaces blanks and nulls with the value predicted by the model. You can add, edit, or remove specific nodes within the SuperNode to further customize the behavior.

Alternatively, you can generate a Select or Filter node to remove fields or records with missing values. For example, you can filter any fields with a quality percentage below a specified threshold.

Figure 6. Generating a Filter node
Generating a Filter node

Outliers and extreme values can be handled in a similar manner. Specify the action you want to take for each field—either coerce, discard, or nullify—and generate a SuperNode to apply the transformations.

Figure 7. Generating a Filter node
Generating a Filter node

After completing the audit and adding the generated nodes to the stream, you can proceed with your analysis. Optionally, you may want to further screen your data using Anomaly Detection, Feature Selection, or a number of other methods.

Figure 8. Stream with Missing Values SuperNode
Stream with Missing Values SuperNode