Anomaly Detection — Product of Data Refinery

By: Ahmed Abdulaal and Subrahmanya Harve

Large scale applications are ubiquitous in today's world, processing hundreds of billions of events and producing thousands of metrics. Sifting through these metrics to surface actionable insights without using scientific methods can be challenging. In this blog post, we explore an introduction to the realm of predictive analytics in the context of anomaly detection.

"Data is the new Oil." The quote is credited to Clive Humby, who is believed to have used it first in 2006.

It is widely accepted today that any data "refinery" will emit insights in the form of metrics. At eBay, applications emit thousands of metrics from various domains such as Checkouts, Search, Payments, Risk, Trust and Shipping to name a few. Maintaining a robust marketplace means systems must be monitored in real time. Essentially, this means tracking the behavior of metrics over time (short/long term) to generate quick and actionable feedback to systems.

However, even with domain focus, the volume of metrics is so high that we need to put systems and processes in place that help call our attention to anomalous events in metrics.

Introduction

In this blog post, we explore a basic introduction to the realm of predictive analytics for metrics in the context of Anomaly Detection based on Models and also present a software design for detecting anomalies in metrics.

More specifically, we explore time series forecasting as an effective way to achieve next step prediction for metrics. Further, this article will focus on metrics that may be represented as a sequence of scalar observations over discrete and regular time periods, a.k.a. univariate time series.

Typically, anomaly detection involves taking historical metric data into consideration, training a model on the data, describing the pattern as a function of historical data points, which is applied in the form of hyper parameters for the model and making a prediction. The prediction is usually in the form of a band of lower value and upper value. On observing a new value in the time series, we identify if the actual value is outside the predicted range and classify the new value as anomalous. Below is a pictorial depiction of this process.

A time series is comprised of three components: trend, seasonality and noise. A time series may be described additively as

y(t) = Trend + Seasonality + Noise

Alternatively, it may be described multiplicatively as

y(t) = Trend * Seasonality * Noise

Time series forecasting is a vast subject that is continually undergoing research, and new models and methods are being created. In our case, we start with a popular model called ARIMA (Auto Regressive Integrated Moving Average). ARIMA is also referred to sometimes as Box Jenkins model. Most of our metrics exhibit seasonal behavior and in some cases multi-seasonal behavior. As a general rule, we pick a seasonality that is closer to our prediction frequency, since that is the most influencing factor in our seasonal prediction. Examples of such seasonality may be weekly or in the case of intra-day predictions, we pick 24-hour seasonality, assuming that data behaves in similar ways at the same time on any two consecutive days.

Next, we have a choice to introduce an external influence to the data in the form of another time series. We refer to this external influence as an exogenous variable. Put together, the model is known as SARIMAX (Seasonal Auto-Regressive Integrated Moving Average with eXogenous variable support).

Let's have a look at the mathematical representation for ARIMA.

AR is a representation of a data point in terms of time-lagged versions of the point until p points:
- y_t= ∅₁ y_t-1 + ∅₂ y_t-2 + ∅₃ y_t-3 … + ∅_p y_t-p
I represents order of differencing to achieve stationarity
- Δy_t= y_t- y_t-d
MA is a representation of past errors that help carry forward lessons learnt from past until q points:
- y_t= Θ_t ∈_t-1+ Θ₂ ∈_t-2 + Θ₃ ∈_t-3 … +Θ_q ∈_t-q
Together:
- ∆ y_t = Σ^p_i=1∅_i ∆_d y_t-i+ Σ^q_j-1 Θ_j ∆_d y_t-j
SARIMAX is simply a product of ARIMA with non-seasonal hyper parameters and ARIMA with seasonal hyper parameters. For simplicity, we will only provide an abstract representation of SARIMAX below:
- ARIMA(p,d,q) (non-seasonal) X (P,D,Q)_S (seasonal)

Hyper parameter selection and tuning

The hyper parameters of ARIMA are p, d, and q. The seasonal hyper parameters are denoted by P, D, Q. S refers to seasonal period for the time series. In addition to the above, the model also needs another parameter that describes the trend of the time series. Usually, the trend will be a constant or observed to vary linearly.

Before we can fit the model to the series, we must choose the right hyper parameters. We can select the values of the hyper parameters in one of three ways:

Plot the ACF (Auto Correlation Function) and PACF (Partial Auto Correlation Function) and visually determine the values.
Use Auto-ARIMA, which is a version of ARIMA that automatically determines the best hyper parameters based on AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values.
Grid search the hyper parameters based on evaluating model performance by minimizing a pre-determined error criteria. The error criteria can be one of RMSE (Root Mean Square Error), MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), MASE (Mean Absolute Scaled Error), ME (Mean Error), MPE (Mean Percentage Error).

In our implementation, we selected option 3 for selecting hyper parameters to avoid erroneous and subjective interpretations of ACF and PACF. The statistical measure, which we aimed to minimize the error of, is the mean, which is also the measure we output as part of the predicted values. These are also referred to as forecasts. (Here is a great article describing grid searching for SARIMA model: https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/.)

In addition, we also needed to decide at confidence interval at which the predicted mean needs to be output. In our case, we chose a 95% CI.

What is an anomaly?

Once the model is deployed, it outputs the forecasted values as a band, bound by the upper CI and lower CI. The forecasted values are compared with the actual values as they arrive in real time. A data point is said to be an anomaly if the actual value is observed to be outside the CI band.

Example

Let's consider a real-world metric and apply what we've learned so far. The metric we are using as an example represents counts at daily intervals of time. Below is a snapshot of the last 15 days of data. However, the more historical data we have leads to a more accurate prediction.

Screen Shot 2020 01 07 at 11.26.31 AM

Decomposing the time series into its constituents of Trend, Seasonality and Noise, we can see that the time series has weekly seasonality, which is useful while grid searching for hyper parameters.

Feeding the time series to the grid searching routine to estimate hyper parameters, we get the following output. The output consists of 3 top models given by the lowest RMSE for mean out of thousands of combinations in the search space. The best performing model has

p,d,q = 1,0,1

P,D,Q = 2,1,0,7

Trend = n, no explicit trend defined

("[(1, 0, 1), (2, 1, 0, 7), 'n']", 9221.353200844014)

("[(2, 0, 0), (2, 1, 0, 7), 'n']", 9280.010864427197)

("[(1, 1, 0), (2, 1, 0, 7), 'n']", 9280.970349459463)

Over a period of time as the time series collects more historical data, we'll need to re-perform grid search to tune the hyper parameters and accommodate new data behavior.

Anomaly detection process flow

The entire process flow is described as a sequence of steps shown in the following diagram.

The following line chart compares how actual values of the metrics have moved over time in relation to a forecasted range defined in terms of upper CI and lower CI. We can see that for most of the duration, the actual values are mostly within the forecasted range, which means our model has been forecasting accurately. Towards the end of the timeline, we do see potential anomalies that have been investigated and classified as true or false anomalies appropriately.

Design

Now that all the details of the model are in place, let's explore the engineering design to deploy in production. We'll also look at how the design and deployment is scaled for more metrics. Below is a simple design showing the logical components at play.

The four key blocks represent the Data Source, Data Processing, Data Sink and Data Visualization.
Data Sources may be comprised of a variety of databases that ideally support fast aggregation queries. For intra-day anomaly detection, we expect to have access to real-time data.
Data Processing involves the following functional blocks
- The Scheduler issues batch triggers to drive the Query Manager. Batch triggers are configured for frequencies at which we need to predict metric values, e.g Daily, Hourly, Three-Hourly.
- The Query Manager manages the selection and execution of predefined aggregated queries specific to metrics.
- A good software design support a degree of configurability. The Hyper Parameters and Alerts are externalized into a configuration file.
- The Model Executor reads the configuration for metrics and generates the model and the next-step forecast for the metrics.
- Forecast comparison and Alerts compares the forecasts with the actual values, and if the actual value of the metric is outside the Confidence Interval band of the predictions, a notification is sent based on the alert configuration.
The predicted values and the actual values are useful to understand how the model has been performing and therefore, we store the data back into a Data Sink, in this case Elastic Search.
Data Visualization is a key component with numerous dashboards with graphics representing raw data to be compared with anomaly data. This allows us to understand how actual values are moving in reference to forecasted values.

Physically, the components of the Data Processing block reside inside a pod in a Tess cluster. Tess.io is eBay's unified cloud infrastructure based on Kubernetes. Multiple pods are then deployed on the cluster with each pod containing Docker containers for various services. It is recommended that a pod host metrics pertaining to a specific domain such as Risk, Payments, and so on offering containment of failures at pod levels. For fault tolerance, we recommend deploying pods into several regions/DCs in Active-Passive configuration.

Physical architecture

Handling anomalies

A false anomaly is one that is essentially a false positive. Even though the actual value was outside the forecasted CI range, the value is attributed to a known/valid reason. In case of a false anomaly, we'd need to let the model learn from this behavior so that the particular false positive may be accounted for for further forecasting. In other words, we continue to use the actual value in further forecasting.

A true anomaly is one that is essentially unexpected and we'll need to investigate further to attribute the right reasons to the anomaly. In this case, we need to prevent the model from learning this behavior since we successfully detected an anomaly. Hence, we replace the actual value with the predicted value, thereby ensuring that a similar data point deviation will continue to be flagged as anomalous for predictions.

Conclusion

Time series forecasting is a complex subject with new research and methodologies being invented regularly. This article presents ARIMA as one of the more basic models in practice that allows for quickly generating returns. It is worth noting that ARIMA is technically a statistical model as opposed to a pure ML model. Additionally, metrics with low dimensionality and low cardinality are best suited to deploy ARIMA model. For metrics with high dimensionality and high cardinality of dimensions, anomaly detection scales better with the use of a model closer to pure Machine Learning deployed as a Recurrent Neural Network.

References

Tags: Big Data, Data Infrastructure and Services, Deep Learning