IBM Data WH KNN Model Options - General
On the Model Options - General tab, you can choose whether to specify a name for the model, or generate a name automatically. You can also set options that control how the number of nearest neighbors is calculated, and set options for enhanced performance and accuracy of the model.
Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name.
Neighbors
Distance measure. The method to be used for measuring the distance between data points; greater distances indicate greater dissimilarities. The options are:
- Euclidean. (default) The distance between two points is computed by joining them with a straight line.
- Manhattan. The distance between two points is calculated as the sum of the absolute differences between their co-ordinates.
- Canberra. Similar to Manhattan distance, but more sensitive to data points closer to the origin.
- Maximum. The distance between two points is calculated as the greatest of their differences along any coordinate dimension.
Number of Nearest Neighbors (k). The number of nearest neighbors for a particular case. Note that using a greater number of neighbors will not necessarily result in a more accurate model.
The choice of k controls the balance between the prevention of overfitting (this may be important, particularly for "noisy" data) and resolution (yielding different predictions for similar instances). You will usually have to adjust the value of k for each data set, with typical values ranging from 1 to several dozen.
Enhance Performance and Accuracy
Standardize measurements before calculating distance. If selected, this option standardizes the measurements for continuous input fields before calculating the distance values.
Use coresets to increase performance for large datasets. If selected, this option uses core set sampling to speed up the calculation when large data sets are involved.