CROSSVALIDATION Subcommand (KNN command)

The CROSSVALIDATION subcommand specifies settings for performing V-fold cross-validation to determine the “best” number of neighbors.

  • V-fold cross validation divides the data into V folds. Then, for a fixed k, it applies nearest neighbor analysis to make predictions on the vth fold (using the other V−1 folds as the training sample) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The value achieving the lowest average error is selected as the optimal value for k.
  • If multiple values of k are tied on the lowest average error, then the smallest k among those that are tied is selected.
  • Cross-validation is not used when /MODEL NEIGHBORS=FIXED or when /MODEL FEATURES=AUTO.
  • It is invalid to specify both the FOLDS and VARIABLE keywords on the CROSSVALIDATION subcommand.

FOLDS Keyword

The FOLDS keyword specifies the number of folds that should be used for cross-validation. The procedure randomly assigns cases to folds, numbered from 1 to the number of folds.

  • Specify an integer greater than 1. The default is 10.
  • For a given training set, the upper limit to the number of folds is the number of cases. If the value of FOLDS is greater than the number of cases in the training partition (and for any split, if SPLIT FILE is in effect), then the number of folds is set to the number of cases in the training partition (for that split).

VARIABLE Keyword

The VARIABLE keyword specifies a variable that assigns each case in the active dataset to a fold from 1 to V.

The variable may not be a dependent variable or any variable specified on the command line factor or covariate lists. The variable must be numeric and take values from 1 to V. If any values in this range are missing on any splits (if SPLIT FILE is in effect) this will cause an error.