How to Determine API Slow Downs, Part 2
Understanding PELT algorithm and its applications in Change Point Detection
A long time ago I wrote an article on how to determine an API is slowing down using simple statistics known as linear regression.
In the conclusion of that article, it was mentioned some challenges in applying linear regression.
- It is hard to define the reference point.
- Difficulty in defining the angle of the regression lines.
The reference point means we need two regression lines to know whether the current situation is normal or abnormal, and the angle between the regression lines is the basis for our judgment.
For those who are familiar with statistics or math, this approach is a bit naive. In fact, I have not formally studied statistics, so I can only use the most straightforward approach, which has been proven to be feasible in practice, but the process of fine-tuning in the early stages will be tough.
After several years of continuous studies, I found a simpler way to do this, which is Changepoint Detection.
Before going into the details, let’s use a diagram to see what Changepoint Detection can do.
The source code is linked here.
As we can see from the diagram, our API slows down twice, and our tool detects the exact point in time. Although I have done smoothing here, in practice it is fine without it, only the detection thresholds need to be adjusted accordingly.
Solution Details
Originally we needed to solve two problems, but now we no longer need a reference point, so only the threshold needs to be defined. I have to say no matter what the mechanism is, thresholds are inevitable, to have an alarm is to need a threshold.
Well, let’s see how the magic happens.
model = Pelt(model="rbf").fit(np.array(df['smoothed_latency']).reshape(-1, 1))
changepoints = model.predict(pen=5)
The source code is a bit long, but the core of the program is these two lines.
We use a famous Changepoint Detection algorithm: Pruned Exact Linear Time aka PELT. This is the algorithm proposed by Killick in 2012.
The whole process is actually three steps.
- Decide what model to use, in this case we use
rbf
. - Feed data points into the model.
- Predict the changes through the model.
Let’s continue to break down these steps.
How to choose the right model?
In ruptures.Pelt
, there are many models (cost functions) are available, the following three are more commonly used.
l2(least squared deviation)
This model is useful mainly for scenarios where there is a change in the average and if there is a significant change in the average, then this model is good to use.
For example, if there is a significant increase in the average latency after a certain release, e.g. 100ms to 200ms, then l2
can easily capture this version.
rbf(Radial Basis Function)
rbf
is more commonly used in the case of irregular variations, such as where both the average and the trend are changing.
For example, a latency in a complex system can be affected by a number of factors, and rbf
can be used to find the more "subtle" changes.
normal
The term normal
refers to a normal distribution or Gaussian distribution. The key to a normal distribution is the mean and standard deviation, so this is used in the context of distributional change.
For example, network traffic during peak hours not only increases in average but also in volatility, which is a kind of distributional change.
How to set the penalty?
In the second line of the code, there is a pen=5
, where pen
refers to the penalty.
Briefly, the larger the penalty, the tighter the model will be, and the fewer changepoints can be found, and vice versa.
So it is still a matter of experimentation as to what value to set, just as we did with linear regression, where we needed to experiment with how to define the angle of the regression lines. Even with PELT, we still need to consider how to set the penalty.
Conclusion
In fact we want to do exactly the same thing as before, we want to detect when the API is slowing down and it is slowing down because of a defect.
At first, we used the angle between the two regression lines to determine this. But this approach, as we mentioned, requires a lot of experimentation to determine how to define the reference point and the threshold of the angle. Although the concept is simple and not difficult to implement, it is not easy to make it work in practice.
In order to reduce the number of experiments, we change the factors from two to one, and only need to determine the penalty through experiments, which is a necessary process no matter what solution is used. In other words, we have greatly reduced the complexity of the API monitoring system.
You may ask, don’t we need to experiment to choose the model? No, not at all.
Because these models have a clear purpose, as long as you know what patterns you want to check, you will naturally choose the corresponding model. The reason we chose rbf
is because we want to know if the API is abnormal as it slows down, which involves not only the average change but also a lot of additional factors, and only rbf
is more fitting.
The source code is quite simple, the core is the two lines introduced in this article. But I have to say, to be able to write these two lines requires a huge accumulation of knowledge, and I also deeply realize that programmer is not just about writing code ONCE AGAIN.