#### AI

# An End-to-End Approach for Scaling Up Spectral Clustering

August 24, 2018 | Written by: IBM Research Editorial Staff

Categorized: AI

Share this post:

Clustering is one of the most fundamental problems in machine learning and data mining tasks, such as image segmentation, power load clustering, and community detection for social networks. But well-known clustering techniques like K-Means, which is based on Euclidean proximity, may not be capable of clustering data that lies on a high-dimensional manifold, as illustrated in the following figure.

In the past two decades, spectral clustering has shown great promise for learning hidden cluster structures from data that are connected but not necessarily compact within convex boundaries. However, a fundamental challenge for spectral clustering is its computational complexity, which is at least quadratic in terms of the number of data points, making it typically not the first choice for large-scale clustering problems. To circumvent the two computational bottlenecks of spectral clustering in large-scale datasets, we developed a novel end-to-end approach for scaling up spectral clustering. We describe our approach and present the theoretical analysis in our recent paper “Scalable Spectral Clustering Using Random Binning Features” [1] at the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2018).

# End-to-end pipeline of spectral clustering

Given a set of N data points X = [x_{1}, …, x_{N}] = X^{Nxd}, with x_{i} ∈ R^{d}, the spectral clustering algorithm constructs a similarity matrix W ∈ R^{NxN}, representing an affinity matrix G = (V, E), where the node v_{i} ∈ V represents the data point x_{i }and the edge E_{ij }represents the similarity between x_{i }and x_{j}. The goal of spectral clustering is to use W to partition x_{1}, …, x_{N }into K clusters. There are many ways for constructing a graph such as using KNN or using graph kernel function. To fully utilize the complete similarity information, we consider a fully connected graph instead of a KNN-based graph for spectral clustering. Without loss of generality, we consider the widely used normalized spectral clustering, where the normalized graph Laplacian L is defined as:

L = I – D^{-½}WD^{-½}

where D ∈ R^{NxN} is the degree matrix with each diagonal element:

D_{ii} = Σ_{j=1,..,N}W_{ij}.

Then we compute spectral decomposition of L and obtain U ∈ R^{NxK }that is the K smallest eigenvectors of L. The high computational costs of the similarity matrix O(N^{2}d) and the eigendecomposition O(KN^{2}) and high memory consumption O(N^{2}) make spectral clustering hardly scalable to large-scale problems.

# Challenges of (naive) spectral clustering

To accelerate spectral clustering, many efforts have been devoted to address these two aforementioned computational bottlenecks: 1) pairwise similarity graph construction of W from the raw data X; and 2) eigendecomposition of the graph Laplacian matrix L. One line of research, such as randomized sketching and power method, sequential reduction algorithm, and graph filtering of random signals, can only partially reduce the computation cost of the eigendecomposition, which still requires quadratic complexity for both computation and memory consumption of graph construction. Another family of research is the use of Landmarks or representatives to jointly improve the computation efficiency of the similarity W and the eigendecomposition of L. Despite promising results in accelerating spectral clustering, these approaches disregard considerable information in the raw data and may lead to degrading clustering performance. More importantly, there is no guarantee that these heuristic methods can approach the results of the original spectral clustering.

# Scaling up spectral clustering

To fill this gap, we developed an end-to-end approach for scaling up spectral clustering (SC) called SC_RB. SC_RB is based on random binning (RB) kernel approximation technique [2] and a state-of-the-art sparse eigensolver (PRIMME) [3] to effectively tackle the two computational bottlenecks: 1) pairwise graph construction; and 2) eigendecomposition of the graph Laplacian matrix.

The first step of SC is to build a similarity graph. We propose to use RB features Z to approximate the pairwise kernel matrix W. RB features are first introduced in [4] and rediscovered in [2] to yield a faster convergence compared to other Random Features methods for scaling up large-scale kernel machine. As a result, we can define our approximate graph Laplacian matrix:

*L̂* = I – *D̂*^{-½}* Ŵ D̂*^{-½} = I – *D̂*^{-½}*ZZ*^{T}* D̂*^{-½},

where Z is a large sparse N x D matrix generated as RB features and *D̂* = diag(*Ŵ*1). Note that *D̂* can be easily computed by two matrix-vector multiplication without the explicit form of *Ŵ*. Therefore, by defining *Ẑ* = *D̂*^{-½} *Z*, we approximate the graph Laplacian matrix:

*L̂* = I – *Ẑ Ẑ*^{T}.

As a result, we reduce computational complexity of graph construction from O(N^{2}d) to O(NRd + NR).

The second step of SC is to compute K number of the smallest eigenvectors of *L̂*. Since computing the largest singular vectors of *Ẑ* is equivalent to computing the smallest eigenvectors of *L̂*, we directly compute K number of the largest singular vectors of *Ẑ*. However, the RB feature matrix *Ẑ* is a very large sparse matrix of the size N x D, making it a challenging task for any standard SVD solver in terms of both slow convergence issue and low memory footprint yet near-optimal convergence. In order to effectively address these challenges, we leverage a current state-of-the-art eigenvalue and SVD solver [3,5] named PReconditioned Iterative MultiMethod Eigensolver (PRIMME). As a result, we reduce computational complexity of subsequent eigendecomposition of graph Laplacian from at least O(N^{2}K) to O(NKR).

Therefore, the total computational complexity and memory consumptions are O(NRd + NKR + NK^{2}), including the cost of K-Means as well. The linear complexity in the number N of data points render SC scalable to large-scale problems.

We have also extended the analysis in [3] to study the convergence of SC under RB approximation and shown that SC via RB converges faster to the exact SC than the standard Random Feature approximation.

In our extensive experiments on eight benchmark datasets evaluated by four different clustering metrics, the SC_RB technique either outperforms or matches the state-of-the-art methods in both accuracy and computational time. More importantly, we corroborate the scalability of our method under two cases: varied number of data samples N and varied number of RB features R. In both cases, our method exhibits linear scalability with respect to N or R. Detailed experimental results are included in our paper [1]. An implementation of SC_RB can be found in the open-source IBM repo on Github.

# References

[1] Scalable Spectral Clustering Using Random Binning Features. Lingfei Wu, Pin-Yu Chen, Ian En-Hsu Yen, Fangli Xu, Yinglong Xia and Charu Aggarwal. KDD 2018. Thursday 1:30PM ‑ 3:00PM Research Track Session RT18: Unsupervised Learning II, ICC Capital Suite Room 8+11 (Level 3) Chair: Matteo Riondato

[2] Revisiting Random Binning Feature: Fast Convergence and Strong Parallelizability. Lingfei Wu, Ian E.H. Yen, Jie Chen and Rui Yan. KDD 2016.

[3] PRIMME_SVDS: A High-Performance Preconditioned SVD Solver for Accurate Large-scale Computations. Lingfei Wu, Eloy Romero, and Andreas Stathopoulos. SIAM J. Sci. Comput. (2017), 39(5), pp. S248–S271.

[4] Random Features for Large-Scale Kernel Machines. Ali Rahimi and Benjamin Recht. NIPS 2007.

[5] PRIMME: PReconditioned Iterative MultiMethod Eigensolver: Methods and software description. A. Stathopoulos and J. R. McCombs. ACM Transaction on Mathematical Software Vol. 37, No. 2, (2010), 21:1-21:30.

*We will present our KDD paper on Thursday, August 23, during Research Track Session RT18: Unsupervised Learning II (Chair: Matteo Riondato), 1:30PM ‑ 3:00PM, in the ICC Capital Suite Room 8+11 (Level 3).*

### IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

IBM sets new performance records for automatic captioning of broadcast news audio, with error rates of 6.5% and 5.9% on two broadcast news benchmarks.

### Leveraging Temporal Dependency to Combat Audio Adversarial Attacks

A new approach to defend against adversarial attacks in non-image tasks, such as audio input and automatic speech recognition.

### Unifying Continual Learning and Meta-Learning with Meta-Experience Replay

Meta-Experience Replay (MER) integrates meta-learning and experience replay to achieve state-of-the-art performance on continual learning benchmarks.