Overview (KNN command)
Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.
Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbors – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbors.
You can specify the number of nearest neighbors to examine; this value is called k. The pictures show how a new case would be classified using two different values of k. When k = 5, the new case is placed in category 1 because a majority of the nearest neighbors belong to category 1. However, when k = 9, the new case is placed in category 0 because a majority of the nearest neighbors belong to category 0.
Nearest neighbor analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case.
Options
Prediction or classification. The dependent variable may be scale, categorical, or a combination. If a dependent variable has scale measurement level, then the model predicts continuous values that approximate the “true” value of some continuous function of the input data. If a dependent variable is categorical, then the model is used to classify cases into the “best” category based on the input predictors.
Rescaling.
KNN
optionally rescales covariates; that
is, predictors with scale measurement level, before training the model.
Adjusted normalization is the rescaling method.
Training and holdout partitions.
KNN
optionally divides the data set into
training and holdout partitions. The model is trained using the training
partition. The holdout partition is completely excluded from the
training process and is used for independent assessment of the final
model.
Missing Values. The KNN
procedure has an option
for treating user-missing values of categorical variables as valid.
User-missing values of scale variables are always treated as invalid.
The procedure uses listwise deletion; that is, cases with invalid
values for any variable are excluded from the model.
Output. KNN
displays a case processing summary as pivot table output, and an
interactive model view of other output. Tables in the model view
include k nearest neighbors and
distances for focal cases, classification of categorical response
variables, and an error summary. Graphical output in the model view
includes an automatic selection error log, feature importance chart,
feature space chart, peers chart, and quadrant map. The procedure
also optionally saves predicted values in the active dataset, PMML
to an external file, and distances to focal cases to a new dataset
or external file.
Basic Specification
The basic specification is the KNN
command followed by zero or one dependent variable, the BY
keyword and one or more factors, and
the WITH
keyword and one or more
covariates.
By default, the KNN
procedure normalizes covariates and selects a training sample before
training the model. The model uses Euclidean distance to select the
three nearest neighbors. User-missing values are excluded and default
output is displayed.
If there are zero response variables, then the procedure finds the k nearest neighbors only – no classification or prediction is done.
Syntax Rules
- All subcommands are optional.
- Subcommands may be specified in any order.
- Only a single instance of each subcommand is allowed.
- An error occurs if a keyword is specified more than once within a subcommand.
- Parentheses, equals signs, and slashes shown in the syntax chart are required.
- The command name, subcommand names, and keywords must be spelled in full.
- Empty subcommands are not allowed.
- Any split variable defined on the
SPLIT FILE
command may not be used as a dependent variable, factor, covariate, or partition variable.
Limitations
Frequency weights specified on the WEIGHT
command are ignored with a warning by the KNN
procedure.
Categorical Variables
Although the KNN
procedure
accepts categorical variables as predictors or dependent variables,
the user should be cautious when using a categorical variable with
a very large number of categories.
The KNN
procedure temporarily recodes categorical predictors
using one-of-c coding for the
duration of the procedure. If there are c categories of a variable, then the variable is stored
as c vectors, with the first
category denoted (1,0,...,0), the next category (0,1,0,...,0), ...,
and the final category (0,0,...,0,1).
This coding scheme increases
the dimensionality of the feature space. In particular, the total
number of dimensions is the number of scale predictors plus the number
of categories across all categorical predictors. As a result, this
coding scheme can lead to slower training. If your nearest neighbors
training is proceeding very slowly, you might try reducing the number
of categories in your categorical predictors by combining similar
categories or dropping cases that have extremely rare categories before
running the KNN
procedure.
All one-of-c coding is based on the training data, even if a holdout sample is defined (see PARTITION Subcommand (KNN command)). Thus, if the holdout sample contains cases with predictor categories that are not present in the training data, then those cases are not scored. If the holdout sample contains cases with dependent variable categories that are not present in the training data, then those cases are scored.
Replicating Results
The KNN
procedure uses random
number generation during random assignment of partitions and cross-validation
folds. To reproduce the same randomized results in the future, use
the SET
command to set the initialization
value for the random number generator before each run of the KNN
procedure, or use variables to define
partitions and cross-validation folds.