CROSSVALIDATION Subcommand (KNN command)
The CROSSVALIDATION
subcommand
specifies settings for performing V-fold cross-validation to determine the “best” number
of neighbors.
- V-fold cross validation divides the data into V folds. Then, for a fixed k, it applies nearest neighbor analysis to make predictions on the vth fold (using the other V−1 folds as the training sample) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The value achieving the lowest average error is selected as the optimal value for k.
- If multiple values of k are tied on the lowest average error, then the smallest k among those that are tied is selected.
- Cross-validation is not used when
/MODEL NEIGHBORS=FIXED
or when/MODEL FEATURES=AUTO
. - It is invalid to specify both the
FOLDS
andVARIABLE
keywords on theCROSSVALIDATION
subcommand.
FOLDS Keyword
The FOLDS
keyword specifies
the number of folds that should be used for cross-validation. The
procedure randomly assigns cases to folds, numbered from 1 to the
number of folds.
- Specify an integer greater than 1. The default is 10.
- For a given training set,
the upper limit to the number of folds is the number of cases. If
the value of
FOLDS
is greater than the number of cases in the training partition (and for any split, ifSPLIT FILE
is in effect), then the number of folds is set to the number of cases in the training partition (for that split).
VARIABLE Keyword
The VARIABLE
keyword specifies
a variable that assigns each case in the active dataset to a fold
from 1 to V.
The variable
may not be a dependent variable or any variable specified on the command
line factor or covariate lists. The variable must be numeric and
take values from 1 to V. If
any values in this range are missing on any splits (if SPLIT FILE
is in effect) this will cause
an error.