[box type=”note” align=”” class=”” width=””]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and Krishna Choppella.[/box]
This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book – Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection.
Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article.
Inputs and outputs
Most algorithms take the following parameters as inputs:
- Window size w, corresponding to the fixed size on which the algorithm looks
for outlier patterns.
- Sliding size s, corresponds to the number of new instances that will be added
to the window, and old ones removed.
- The count threshold k of instances when using nearest neighbor computation.
- The distance threshold R used to define the outlier threshold in distances.
Outliers as labels or scores (based on neighbors and distance) are outputs.
How does it work?
We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported.
Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points:
- Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors.
- New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure.
- Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier.
Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in:
- Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window.
- New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure.
- Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers.
Direct Update of Events (DUE)
DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window:
- Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list.
- New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list.
- Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers.
Micro Clustering based Algorithm (MCOD)
Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers:
- Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points.
- New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers.
- Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier.
Advantages and limitations
The advantages and limitations are as follows:
- Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow.
- Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen.
- DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the “inlierness” of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory.
- MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too.
Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness.
If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.