Neighbors

The Neighbors panel has a set of options that control how the number of nearest neighbors is calculated.

Number of Nearest Neighbors (k). Specify the number of nearest neighbors for a particular case. Note that using a greater number of neighbors will not necessarily result in a more accurate model.

If the objective is to predict a target, you have two choices:

  • Specify fixed k. Use this option if you want to specify a fixed number of nearest neighbors to find.
  • Automatically select k. You can alternatively use the Minimum and Maximum fields to specify a range of values and allow the procedure to choose the "best" number of neighbors within that range. The method for determining the number of nearest neighbors depends upon whether feature selection is requested on the Feature Selection panel:

    If feature selection is in effect, then feature selection is performed for each value of k in the requested range, and the k, and accompanying feature set, with the lowest error rate (or the lowest sum-of-squares error if the target is continuous) is selected.

    If feature selection is not in effect, then V-fold cross-validation is used to select the “best” number of neighbors. See the Cross-validation panel for control over assignment of folds.

Distance Computation. This is the metric used to specify the distance metric used to measure the similarity of cases.

  • Euclidean metric. The distance between two cases, x and y, is the square root of the sum, over all dimensions, of the squared differences between the values for the cases.
  • City Block metric. The distance between two cases is the sum, over all dimensions, of the absolute differences between the values for the cases. Also called Manhattan distance.

Optionally, if the objective is to predict a target, you can choose to weight features by their normalized importance when computing distances. Feature importance for a predictor is calculated by the ratio of the error rate or sum-of-squares error of the model with the predictor removed from the model, to the error rate or sum-of-squares error for the full model. Normalized importance is calculated by reweighting the feature importance values so that they sum to 1.

Weight features by importance when computing distances. (Displayed only if the objective is to predict a target.) Check this box to cause predictor importance to be used when calculating the distances between neighbors. Predictor importance will then be displayed in the model nugget, and used in predictions (and so will affect scoring). See the topic Predictor Importance for more information.

Predictions for Range Target. (Displayed only if the objective is to predict a target.) If a continuous (numeric range) target is specified, this defines whether the predicted value is computed based upon the mean or the median value of the nearest neighbors.