HDBSCAN node Build Options
Use the Build Options tab to specify build options for the HDBSCAN node, including basic options for cluster parameters and cluster labels, and advanced options for advanced parameters and chart output options. For additional information about these options, see the following online resources:
Basic
Hyper-Parameter Optimization (Based on Rbfopt). Select this option to enable Hyper-Parameter Optimization based on Rbfopt, which automatically discovers the optimal combination of parameters so that the model will achieve the expected or lower error rate on the samples. For details about Rbfopt, see http://rbfopt.readthedocs.io/en/latest/rbfopt_settings.html.
Min Cluster Size. Specify the minimum size of clusters. Single linkage splits that contain fewer points than the value specified here will be considered points "falling out" of a cluster rather than a cluster splitting into two new clusters.
Min Samples. Specify the minimum number of samples in a neighborhood for a point to be considered a core point. If set to 0, the default value is the minimum cluster size value.
Algorithm. Select the algorithm to use. HDBSCAN has variants that are specialized for different characteristics of the data. By default, BEST is used - which automatically chooses the best algorithm given the nature of the data. For details about these algorithm types, see the HDBSCAN documentation.1 Note that the algorithm you choose will impact performance. For example, for large data we recommend trying Boruvka KDTree or Boruvka BallTree.
Metric for Distance. Select the metric to use when calculating distance between instances in a feature array.
Cluster Label. Specify whether the cluster label is a number or a string. If you choose String, specify a prefix for the cluster label (for example, the default prefix is cluster, which results in cluster labels such as cluster-1, cluster-2, etc.).
Advanced
Approximate Minimum Spanning Tree. Select True if you want to accept an approximate minimum spanning tree. For some algorithms, this can improve performance, but the resulting clustering might be of marginally lower quality. If you are willing to sacrifice speed for correctness, you may want to try the False option. In most cases, True is recommended.
Method to Select Cluster. Select which method to use for selecting clusters from the condensed tree. The standard approach for HDBSCAN is to use an Excess of Mass (EOM) algorithm to find the most persistent clusters. Or you can select the clusters at the leaves of the tree, which provides the most fine-grained and homogeneous clusters.
Accept Single Cluster. Change this setting to True to allow single cluster results only if this is a valid result for your dataset.
P Value. If using the Minkowski metric for distance (under Basic build options), you can change this p value if desired.
Leaf Size. If using a space tree algorithm (Boruvka KDTree or Boruvka BallTree), this is the number of points in a leaf node of the tree. This setting doesn't alter the resulting clustering, but it may impact the run time of the algorithm.
Validity Index. Select this option to include the Validity Index chart in the model nugget output.
Condensed Tree. Select this option to include the Condensed Tree chart in the model nugget output.
Single Linkage Tree. Select this option to include the Single Linkage Tree chart in the model nugget output.
Min Span Tree. Select this option to include the Min Span Tree chart in the model nugget output.
SPSS Modeler setting | Script name (property name) | HDBSCAN parameter |
---|---|---|
Inputs | inputs |
inputs |
Hyper-Parameter Optimization | useHPO |
|
Min Cluster Size | min_cluster_size |
min_cluster_size |
Min Samples | min_samples |
min_samples |
Algorithm | algorithm |
algorithm |
Metric for Distance | metric |
metric |
Cluster Label | useStringLabel |
|
Label Prefix | stringLabelPrefix |
|
Approximate Minimum Spanning Tree | approx_min_span_tree |
approx_min_span_tree |
Method to Select Cluster | cluster_selection_method |
cluster_selection_method |
Accept Single Cluster | allow_single_cluster |
allow_single_cluster |
P Value | p_value |
p_value |
Leaf Size | leaf_size |
leaf_size |
Validity Index | outputValidity |
|
Condensed Tree | outputCondensed |
|
Single Linkage Tree | outputSingleLinkage |
|
Min Span Tree | outputMinSpan |
|
1 "API Reference." The hdbscan Clustering Library. Web. © 2016, Leland McInnes, John Healy, Steve Astels.
2 "User Guide / Tutorial." The hdbscan Clustering Library. Web. © 2016, Leland McInnes, John Healy, Steve Astels.