Imputing Missing Values

The audit report lists the percentage of complete records for each field, along with the number of valid, null, and blank values. You can choose to impute missing values for specific fields as appropriate, and then generate a SuperNode to apply these transformations.

  1. In the Impute Missing column, specify the type of values you want to impute, if any. You can choose to impute blanks, nulls, both, or specify a custom condition or expression that selects the values to impute.

    There are several types of missing values recognized by IBM® SPSS® Modeler:

    • Null or system-missing values. These are nonstring values that have been left blank in the database or source file and have not been specifically defined as "missing" in a source or Type node. System-missing values are displayed as $null$. Note that empty strings are not considered nulls in IBM SPSS Modeler, although they may be treated as nulls by certain databases.
    • Empty strings and white space. Empty string values and white space (strings with no visible characters) are treated as distinct from null values. Empty strings are treated as equivalent to white space for most purposes. For example, if you select the option to treat white space as blanks in a source or Type node, this setting applies to empty strings as well.
    • Blank or user-defined missing values. These are values such as unknown, 99, or –1 that are explicitly defined in a source node or Type node as missing. Optionally, you can also choose to treat nulls and white space as blanks, which allows them to be flagged for special treatment and to be excluded from most calculations. For example, you can use the @BLANK function to treat these values, along with other types of missing values, as blanks.
  2. In the Method column, specify the method you want to use.

    The following methods are available for imputing missing values:

    Fixed. Substitutes a fixed value (either the field mean, midpoint of the range, or a constant that you specify).

    Random. Substitutes a random value based on a normal or uniform distribution.

    Expression. Allows you to specify a custom expression. For example, you could replace values with a global variable created by the Set Globals node.

    Algorithm. Substitutes a value predicted by a model based on the C&RT algorithm. For each field imputed using this method, there will be a separate C&RT model, along with a Filler node that replaces blanks and nulls with the value predicted by the model. A Filter node is then used to remove the prediction fields generated by the model.

  3. To generate a Missing Values SuperNode, from the menus choose:

    Generate > Missing Values SuperNode

    The Missing Values SuperNode dialog box is displayed.

  4. Select All fields or Selected fields only, and specify a sample size if desired. (The specified sample is a percentage; by default, 10% of all records are sampled.)
  5. Click OK to add the generated SuperNode to the stream canvas.
  6. Attach the SuperNode to the stream to apply the transformations.

Within the SuperNode, a combination of model nugget, Filler, and Filter nodes is used as appropriate. To understand how it works, you can edit the SuperNode and click Zoom In, and you can add, edit, or remove specific nodes within the SuperNode to fine-tune the behavior.