Neighbors (Nearest Neighbor Analysis)

Number of Nearest Neighbors (k). Specify the number of nearest neighbors. Note that using a greater number of neighbors will not necessarily result in a more accurate model.

If a target is specified on the Variables tab, you can alternatively specify a range of values and allow the procedure to choose the "best" number of neighbors within that range. The method for determining the number of nearest neighbors depends upon whether feature selection is requested on the Features tab.

  • If feature selection is in effect, then feature selection is performed for each value of k in the requested range, and the k, and accompanying feature set, with the lowest error rate (or the lowest sum-of-squares error if the target is scale) is selected.
  • If feature selection is not in effect, then V-fold cross-validation is used to select the “best” number of neighbors. See the Partition tab for control over assignment of folds.

Distance Computation. This is the metric used to specify the distance metric used to measure the similarity of cases.

  • Euclidean metric. The distance between two cases, x and y, is the square root of the sum, over all dimensions, of the squared differences between the values for the cases.
  • City block metric. The distance between two cases is the sum, over all dimensions, of the absolute differences between the values for the cases. Also called Manhattan distance.

Optionally, if a target is specified on the Variables tab, you can choose to weight features by their normalized importance when computing distances. Feature importance for a predictor is calculated by the ratio of the error rate or sum-of-squares error of the model with the predictor removed from the model to the error rate or sum-of-squares error for the full model. Normalized importance is calculated by reweighting the feature importance values so that they sum to 1.

Predictions for Scale Target. If a scale target is specified on the Variables tab, this specifies whether the predicted value is computed based upon the mean or the median value of the nearest neighbors.