CROSSVALIDATION Subcommand (KNN command)
The CROSSVALIDATION subcommand
specifies settings for performing V-fold cross-validation to determine the “best” number
of neighbors.
- V-fold cross validation divides the data into V folds. Then, for a fixed k, it applies nearest neighbor analysis to make predictions on the vth fold (using the other V−1 folds as the training sample) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The value achieving the lowest average error is selected as the optimal value for k.
- If multiple values of k are tied on the lowest average error, then the smallest k among those that are tied is selected.
- Cross-validation is not used when
/MODEL NEIGHBORS=FIXEDor when/MODEL FEATURES=AUTO. - It is invalid to specify both the
FOLDSandVARIABLEkeywords on theCROSSVALIDATIONsubcommand.
FOLDS Keyword
The FOLDS keyword specifies
the number of folds that should be used for cross-validation. The
procedure randomly assigns cases to folds, numbered from 1 to the
number of folds.
- Specify an integer greater than 1. The default is 10.
- For a given training set,
the upper limit to the number of folds is the number of cases. If
the value of
FOLDSis greater than the number of cases in the training partition (and for any split, ifSPLIT FILEis in effect), then the number of folds is set to the number of cases in the training partition (for that split).
VARIABLE Keyword
The VARIABLE keyword specifies
a variable that assigns each case in the active dataset to a fold
from 1 to V.
The variable
may not be a dependent variable or any variable specified on the command
line factor or covariate lists. The variable must be numeric and
take values from 1 to V. If
any values in this range are missing on any splits (if SPLIT FILE is in effect) this will cause
an error.