Table of contents


Hierarchical Density-Based Spatial Clustering (HDBSCAN)© uses unsupervised learning to find clusters, or dense regions, of a data set.

The HDBSCAN node in Cloud Pak for Data exposes the core features and commonly used parameters of the HDBSCAN library. The node is implemented in Python, and you can use it to cluster your dataset into distinct groups when you don't know what those groups are at first. Unlike most learning methods in Cloud Pak for Data, HDBSCAN models do not use a target field. This type of learning, with no target field, is called unsupervised learning. Rather than trying to predict an outcome, HDBSCAN tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar. The HDBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by HDBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. Outlier points that lie alone in low-density regions are also marked. HDBSCAN also supports scoring of new samples.1

To use the HDBSCAN node, you must set up an upstream Type node. The HDBSCAN node will read input values from the Type node (or from the Types of an upstream import node).

For more information about HDBSCAN clustering algorithms, see the HDBSCAN documentation. 1

1 "User Guide / Tutorial." The hdbscan Clustering Library. Web. © 2016, Leland McInnes, John Healy, Steve Astels.