Unsupervised learning techniques do not require labeled data and can handle more complex data sets. Unsupervised learning is powered by deep learning and neural networks or auto encoders that mimic the way biological neurons signal to each other. These powerful tools can find patterns from input data and make assumptions about what data is perceived as normal.
These techniques can go a long way in discovering unknown anomalies and reducing the work of manually sifting through large data sets. However, data scientists should monitor results gathered through unsupervised learning. Because these techniques are making assumptions about the data being input, it is possible for them to incorrectly label anomalies.
Machine learning algorithms for unstructured data include:
K-means: This algorithm is a data visualization technique that processes data points through a mathematical equation with the intention of clustering similar data points. “Means,” or average data, refers to the points in the center of the cluster that all other data is related to. Through data analysis, these clusters can be used to find patterns and make inferences about data that is found to be out of the ordinary.
Isolation forest: This type of anomaly detection algorithm uses unsupervised data. Unlike supervised anomaly detection techniques, which work from labeled normal data points, this technique attempts to isolate anomalies as the first step. Similar to a “random forest,” it creates “decision trees,” which map out the data points and randomly select an area to analyze. This process is repeated, and each point receives an anomaly score between 0 and 1, based on its location to the other points; values below .5 are generally considered to be normal, while values that exceed that threshold are more likely to be anomalous. Isolation forest models can be found on the free machine learning library for Python, scikit-learn.
One-class support vector machine (SVM): This anomaly detection technique uses training data to make boundaries around what is considered normal. Clustered points within the set boundaries are considered normal and those outside are labeled as anomalies.