Overview (KNN command)

Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.

Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbors – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbors.

You can specify the number of nearest neighbors to examine; this value is called k. The pictures show how a new case would be classified using two different values of k. When k = 5, the new case is placed in category 1 because a majority of the nearest neighbors belong to category 1. However, when k = 9, the new case is placed in category 0 because a majority of the nearest neighbors belong to category 0.

Nearest neighbor analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case.

Options

Prediction or classification. The dependent variable may be scale, categorical, or a combination. If a dependent variable has scale measurement level, then the model predicts continuous values that approximate the “true” value of some continuous function of the input data. If a dependent variable is categorical, then the model is used to classify cases into the “best” category based on the input predictors.

Rescaling. KNN optionally rescales covariates; that is, predictors with scale measurement level, before training the model. Adjusted normalization is the rescaling method.

Training and holdout partitions. KNN optionally divides the data set into training and holdout partitions. The model is trained using the training partition. The holdout partition is completely excluded from the training process and is used for independent assessment of the final model.

Missing Values. The KNN procedure has an option for treating user-missing values of categorical variables as valid. User-missing values of scale variables are always treated as invalid. The procedure uses listwise deletion; that is, cases with invalid values for any variable are excluded from the model.

Output. KNN displays a case processing summary as pivot table output, and an interactive model view of other output. Tables in the model view include k nearest neighbors and distances for focal cases, classification of categorical response variables, and an error summary. Graphical output in the model view includes an automatic selection error log, feature importance chart, feature space chart, peers chart, and quadrant map. The procedure also optionally saves predicted values in the active dataset, PMML to an external file, and distances to focal cases to a new dataset or external file.

Basic Specification

The basic specification is the KNN command followed by zero or one dependent variable, the BY keyword and one or more factors, and the WITH keyword and one or more covariates.

By default, the KNN procedure normalizes covariates and selects a training sample before training the model. The model uses Euclidean distance to select the three nearest neighbors. User-missing values are excluded and default output is displayed.

If there are zero response variables, then the procedure finds the k nearest neighbors only – no classification or prediction is done.

Note: Measurement level can affect the results. If any variables (fields) have an unknown measurement level, a data pass is performed to determine the measurement level before the analysis begins. For information on the determination criteria, see SET SCALEMIN.

Syntax Rules

  • All subcommands are optional.
  • Subcommands may be specified in any order.
  • Only a single instance of each subcommand is allowed.
  • An error occurs if a keyword is specified more than once within a subcommand.
  • Parentheses, equals signs, and slashes shown in the syntax chart are required.
  • The command name, subcommand names, and keywords must be spelled in full.
  • Empty subcommands are not allowed.
  • Any split variable defined on the SPLIT FILE command may not be used as a dependent variable, factor, covariate, or partition variable.

Limitations

Frequency weights specified on the WEIGHT command are ignored with a warning by the KNN procedure.

Categorical Variables

Although the KNN procedure accepts categorical variables as predictors or dependent variables, the user should be cautious when using a categorical variable with a very large number of categories.

The KNN procedure temporarily recodes categorical predictors using one-of-c coding for the duration of the procedure. If there are c categories of a variable, then the variable is stored as c vectors, with the first category denoted (1,0,...,0), the next category (0,1,0,...,0), ..., and the final category (0,0,...,0,1).

This coding scheme increases the dimensionality of the feature space. In particular, the total number of dimensions is the number of scale predictors plus the number of categories across all categorical predictors. As a result, this coding scheme can lead to slower training. If your nearest neighbors training is proceeding very slowly, you might try reducing the number of categories in your categorical predictors by combining similar categories or dropping cases that have extremely rare categories before running the KNN procedure.

All one-of-c coding is based on the training data, even if a holdout sample is defined (see PARTITION Subcommand (KNN command)). Thus, if the holdout sample contains cases with predictor categories that are not present in the training data, then those cases are not scored. If the holdout sample contains cases with dependent variable categories that are not present in the training data, then those cases are scored.

Replicating Results

The KNN procedure uses random number generation during random assignment of partitions and cross-validation folds. To reproduce the same randomized results in the future, use the SET command to set the initialization value for the random number generator before each run of the KNN procedure, or use variables to define partitions and cross-validation folds.