Scoring Wizard: Matching model fields to dataset fields

In order to score the active dataset, the dataset must contain fields (variables) that correspond to all the predictors in the model. If the model also contains split fields, then the dataset must also contain fields that correspond to all the split fields in the model.

  • By default, any fields in the active dataset that have the same name and type as fields in the model are automatically matched.
  • Use the drop-down list to match dataset fields to model fields. The data type for each field must be the same in both the model and the dataset in order to match fields.
  • You cannot continue with the wizard or score the active dataset unless all predictors (and split fields if present) in the model are matched with fields in the active dataset.

Dataset Fields. The drop-down list contains the names of all the fields in the active dataset. Fields that do not match the data type of the corresponding model field cannot be selected.

Model Fields. The fields used in the model.

Role. The displayed role can be one of the following:

  • Predictor. The field is used as a predictor in the model. That is, values of the predictors are used to "predict" values of the target outcome of interest.
  • Split. The values of the split fields are used to define subgroups, which are each scored separately. There is a separate subgroup for each unique combination of split field values. (Note: splits are only available for some models.)
  • Record ID. Record (case) identifier.

Measure. Measurement level for the field as defined in the model. For models in which measurement level can affect the scores, the measurement level as defined in the model is used, not the measurement level as defined in the active dataset. For more information on measurement level, see Variable measurement level.

Type. Data type as defined in the model. The data type in the active dataset must match the data type in the model. Data type can be one of the following:

  • String. Fields with a data type of string in the active dataset match the data type of string in the model.
  • Numeric. Numeric fields with display formats other than date or time formats in the active dataset match the numeric data type in the model. This includes F (numeric), Dollar, Dot, Comma, E (scientific notation), and custom currency formats. Fields with Wkday (day of week) and Month (month of year) formats are also considered numeric, not dates. For some model types, date and time fields in the active dataset are also considered a match for the numeric data type in the model.
  • Date. Numeric fields with display formats that include the date but not the time in the active dataset match the date type in the model. This includes Date (dd-mm-yyyy), Adate (mm/dd/yyyy), Edate (dd.mm.yyyy), Sdate (yyyy/mm/dd), and Jdate (dddyyyy).
  • Time. Numeric fields with display formats that include the time but not the date in the active dataset match the time data type in the model. This includes Time (hh:mm:ss) and Dtime (dd hh:mm:ss)
  • Timestamp. Numeric fields with a display format that includes both the date and the time in the active dataset match the timestamp data type in the model. This corresponds to the Datetime format (dd-mm-yyyy hh:mm:ss) in the active dataset.

Note: In addition to field name and type, you should make sure that the actual data values in the dataset being scored are recorded in the same fashion as the data values in the dataset used to build the model. For example, if the model was built with an Incomefield that has income divided into four categories, and IncomeCategory in the active dataset has income divided into six categories or four different categories, those fields don't really match each other and the resulting scores will not be reliable.

Missing Values

This group of options controls the treatment of missing values, encountered during the scoring process, for the predictor variables defined in the model. A missing value in the context of scoring refers to one of the following:

  • A predictor contains no value. For numeric fields (variables), this means the system-missing value. For string fields, this means a null string.
  • The value has been defined as user-missing, in the model, for the given predictor. Values defined as user-missing in the active dataset, but not in the model, are not treated as missing values in the scoring process.
  • The predictor is categorical and the value is not one of the categories defined in the model.

Use Value Substitution. Attempt to use value substitution when scoring cases with missing values. The method for determining a value to substitute for a missing value depends on the type of predictive model.

  • Linear Regression and Discriminant models. For independent variables in linear regression and discriminant models, if mean value substitution for missing values was specified when building and saving the model, then this mean value is used in place of the missing value in the scoring computation, and scoring proceeds. If the mean value is not available, then the system-missing value is returned.
  • Decision Tree models. For the CHAID and Exhaustive CHAID models, the biggest child node is selected for a missing split variable. The biggest child node is the one with the largest population among the child nodes using learning sample cases. For C&RT and QUEST models, surrogate split variables (if any) are used first. (Surrogate splits are splits that attempt to match the original split as closely as possible using alternate predictors.) If no surrogate splits are specified or all surrogate split variables are missing, the biggest child node is used.
  • Logistic Regression models. For covariates in logistic regression models, if a mean value of the predictor was included as part of the saved model, then this mean value is used in place of the missing value in the scoring computation, and scoring proceeds. If the predictor is categorical (for example, a factor in a logistic regression model), or if the mean value is not available, then the system-missing value is returned.

Use System-Missing. Return the system-missing value when scoring a case with a missing value.