Partitions (Nearest Neighbor Analysis)

The Partitions tab allows you to divide the dataset into training and holdout sets and, when applicable, assign cases into cross-validation folds

Training and Holdout Partitions. This group specifies the method of partitioning the active dataset into training and holdout samples. The training sample comprises the data records used to train the nearest neighbor model; some percentage of cases in the dataset must be assigned to the training sample in order to obtain a model. The holdout sample is an independent set of data records used to assess the final model; the error for the holdout sample gives an "honest" estimate of the predictive ability of the model because the holdout cases were not used to build the model.

  • Randomly assign cases to partitions. Specify the percentage of cases to assign to the training sample. The rest are assigned to the holdout sample.
  • Use variable to assign cases. Specify a numeric variable that assigns each case in the active dataset to the training or holdout sample. Cases with a positive value on the variable are assigned to the training sample, cases with a value of 0 or a negative value, to the holdout sample. Cases with a system-missing value are excluded from the analysis. Any user-missing values for the partition variable are always treated as valid.

Cross-Validation Folds. V-fold cross-validation is used to determine the "best" number of neighbors. It is not available in conjunction with feature selection for performance reasons.

Cross-validation divides the sample into a number of subsamples, or folds. Nearest neighbor models are then generated, excluding the data from each subsample in turn. The first model is based on all of the cases except those in the first sample fold, the second model is based on all of the cases except those in the second sample fold, and so on. For each model, the error is estimated by applying the model to the subsample excluded in generating it. The "best" number of nearest neighbors is the one which produces the lowest error across folds.

  • Randomly assign cases to folds. Specify the number of folds that should be used for cross-validation. The procedure randomly assigns cases to folds, numbered from 1 to V, the number of folds.
  • Use variable to assign cases. Specify a numeric variable that assigns each case in the active dataset to a fold. The variable must be numeric and take values from 1 to V. If any values in this range are missing, and on any splits if split files are in effect, this will cause an error. See the topic Split file for more information.

Set seed for Mersenne Twister. Setting a seed allows you to replicate analyses. Using this control is similar to setting the Mersenne Twister as the active generator and specifying a fixed starting point on the Random Number Generators dialog, with the important difference that setting the seed in this dialog will preserve the current state of the random number generator and restore that state after the analysis is complete. See the topic Random Number Generators for more information.