HDBSCAN node Build Options

Use the Build Options tab to specify build options for the HDBSCAN node, including basic options for cluster parameters and cluster labels, and advanced options for advanced parameters and chart output options. For additional information about these options, see the following online resources:

Basic

Hyper-Parameter Optimization (Based on Rbfopt). Select this option to enable Hyper-Parameter Optimization based on Rbfopt, which automatically discovers the optimal combination of parameters so that the model will achieve the expected or lower error rate on the samples. For details about Rbfopt, see http://rbfopt.readthedocs.io/en/latest/rbfopt_settings.html.

Min Cluster Size. Specify the minimum size of clusters. Single linkage splits that contain fewer points than the value specified here will be considered points "falling out" of a cluster rather than a cluster splitting into two new clusters.

Min Samples. Specify the minimum number of samples in a neighborhood for a point to be considered a core point. If set to 0, the default value is the minimum cluster size value.

Algorithm. Select the algorithm to use. HDBSCAN has variants that are specialized for different characteristics of the data. By default, BEST is used - which automatically chooses the best algorithm given the nature of the data. For details about these algorithm types, see the HDBSCAN documentation.1 Note that the algorithm you choose will impact performance. For example, for large data we recommend trying Boruvka KDTree or Boruvka BallTree.

Metric for Distance. Select the metric to use when calculating distance between instances in a feature array.

Cluster Label. Specify whether the cluster label is a number or a string. If you choose String, specify a prefix for the cluster label (for example, the default prefix is cluster, which results in cluster labels such as cluster-1, cluster-2, etc.).

Advanced

Approximate Minimum Spanning Tree. Select True if you want to accept an approximate minimum spanning tree. For some algorithms, this can improve performance, but the resulting clustering might be of marginally lower quality. If you are willing to sacrifice speed for correctness, you may want to try the False option. In most cases, True is recommended.

Method to Select Cluster. Select which method to use for selecting clusters from the condensed tree. The standard approach for HDBSCAN is to use an Excess of Mass (EOM) algorithm to find the most persistent clusters. Or you can select the clusters at the leaves of the tree, which provides the most fine-grained and homogeneous clusters.

Accept Single Cluster. Change this setting to True to allow single cluster results only if this is a valid result for your dataset.

P Value. If using the Minkowski metric for distance (under Basic build options), you can change this p value if desired.

Leaf Size. If using a space tree algorithm (Boruvka KDTree or Boruvka BallTree), this is the number of points in a leaf node of the tree. This setting doesn't alter the resulting clustering, but it may impact the run time of the algorithm.

Validity Index. Select this option to include the Validity Index chart in the model nugget output.

Condensed Tree. Select this option to include the Condensed Tree chart in the model nugget output.

Single Linkage Tree. Select this option to include the Single Linkage Tree chart in the model nugget output.

Min Span Tree. Select this option to include the Min Span Tree chart in the model nugget output.

The following table shows the relationship between the settings in the SPSS® Modeler HDBSCAN node dialog and the Python HDBSCAN library parameters.
Table 1. Node properties mapped to Python library parameters
SPSS Modeler setting Script name (property name) HDBSCAN parameter
Inputs inputs inputs
Hyper-Parameter Optimization useHPO
Min Cluster Size min_cluster_size min_cluster_size
Min Samples min_samples min_samples
Algorithm algorithm algorithm
Metric for Distance metric metric
Cluster Label useStringLabel
Label Prefix stringLabelPrefix
Approximate Minimum Spanning Tree approx_min_span_tree approx_min_span_tree
Method to Select Cluster cluster_selection_method cluster_selection_method
Accept Single Cluster allow_single_cluster allow_single_cluster
P Value p_value p_value
Leaf Size leaf_size leaf_size
Validity Index outputValidity
Condensed Tree outputCondensed
Single Linkage Tree outputSingleLinkage
Min Span Tree outputMinSpan

1 "API Reference." The hdbscan Clustering Library. Web. © 2016, Leland McInnes, John Healy, Steve Astels.

2 "User Guide / Tutorial." The hdbscan Clustering Library. Web. © 2016, Leland McInnes, John Healy, Steve Astels.