Accelerating API Responses with Smart Architecture

Experience a boost in API response times with our paradigm-shifting move from Semi-Lambda to TiDB & Kappa architecture

Chunting Wu
Stackademic

--

Photo by Markus Spiske on Unsplash

We recently launched a big technical revamp, and the results have been impressive. We’ve seen a 9x improvement in response times for specific APIs (from 9 seconds down to 1 second).

You may ask why the API response time is so long. It’s mainly because of the product characteristics, it’s a data product, there will be a lot of data aggregation and analysis, and the API response will be the result of a period of analysis. Therefore, in general, there will be a longer latency.

Nevertheless, through this technical revamp, we have significantly reduced the API latency while maintaining the product features.

This is due to two core changes.

  1. Database changes
  2. Kappa architecture

Let’s take a closer look at what we did.

Semi-Lambda Architecture

At first, our architecture is a “semi-Lambda” architecture. To understand what a semi-Lambda architecture is, let’s first look at a Lambda architecture.

A typical Lambda architecture has three components.

  1. Batch layer
  2. Speed layer
  3. Serving layer

Data processing from the data source is divided into two paths, one is periodic batch processing, which has the advantage of processing a large amount of data quickly.

On the other hand, in order to enhance the freshness of the analysis, there is real-time processing to calculate the data in the current time period. The results from both sides are aggregated into a serving layer so that end users can get the final results.

For example, if the time is 12:30 and the batch layer runs every full hour, then the data from 12:01 to 12:30 will be processed by the speed layer.

The above is a standard Lambda architecture.

Then, what is a semi-Lambda architecture? The illustration is as follows.

We also have a batch layer and a serving layer, but we are missing the speed layer. Instead, we write the raw data directly into the serving layer, and the client, e.g. API, which is invoked by the upper layer, runs the logic to process the raw data and combine it with the batch result.

In other words, we are the on-the-fly speed layer.

The drawbacks of such an architecture are obvious.

  1. Postgres is a monolithic database, so even though it can be read-write split, the replicas still need to synchronize with the master constantly. This is fatal in large raw data write scenarios, where I/O consumes a lot of system resources.
  2. Postgres is limited in its capability to handle big data. Although it is possible to make data blocks smaller and easier to read by partitioning, a single database is not enough when the algorithm is designed for a large number of partitions.

Therefore, we need to propose solutions to these two problems.

Streaming ingest problem

First of all, we need to solve the database bottleneck of writing large amounts of data. Therefore, a proper database is essential.

For those who have been following me, you should know I have been studying RTOLAP databases such as Apache Pinot for some time. But in the end, we didn’t go with RTOLAP.

The main reason is RTOLAP database does not perform well in two specific scenarios.

  1. Upsert
  2. Join

Because of the application features, we need to perform frequent upserts on big data, and in addition, we need to do a large number of joins, doesn’t matter if it’s a cross-table join or self-table join.

On the other hand, it is difficult to say RTOLAP database has comprehensive support for SQL statements, and we want to minimize the amount of trial-and-error work required for migration, so we didn’t choose RTOLAP database in the end.

Fortunately, around the same time, we got some help from a vendor and learned about a distributed database completely compatible with MySQL, i.e. TiDB. Previously, I wrote an article describing TiDB application scenarios. In fact, TiDB can support both OLTP and OLAP scenarios, fitting our needs.

Speed layer problem

Once the database selection is finalized, the next step is to address the lack of a complete speed layer.

There are two possible directions.

  1. Build a complete Lambda speed layer.
  2. Change to Kappa architecture

In the end, we chose the latter and started to move the architecture to Kappa.

The reason is both Lambda and Kappa need to build streaming processing capabilities, but to build a speed layer for the Lambda architecture we still need to rewrite the batch logic in a streaming framework. It would be better to migrate the batch logic to the Kappa framework so that if there is a need for any feature in the future, we just need to add the new logic to the streaming framework.

This is one of the advantages of Kappa over Lambda, there is no need to maintain two pipelines (batch and real-time) with the same logic.

After moving to the Kappa paradigm, you can see that the whole architecture has become simple, with Flink being responsible for data processing and batch processing disappearing. In addition, Postgres was moved to TiDB, which of course required changes to the query syntax, but it was well worth it.

Finally, we solved the OLAP query latency problem by using streaming pre-processing and the power of the TiDB distributed database.

Conclusion

Paradigm shift is a battle against time, and how to keep optimizing the system while guaranteeing development productivity is a big challenge.

In addition to initiating discussions and a series of technical selections, it is more important to plan the actual migration and develop efficiently. This requires architects, PMs, and developers to work together.

The question is frequently asked: How much of a benefit is this migration? It’s a tough question to answer before getting one’s hands dirty. But if we don’t know the benefits, how can we have the confidence to invest for such a long period of time?

From my point of view, we need to list the problems we want to solve and prioritize them in a way that will affect the final solution. Then, identify one or two showcases to act as pilots, and learn as much as possible from these showcases, rather than launching a large-scale revamp all at once.

When results are achieved in the showcases, we will be able to measure the benefits.

Engineering is different from science in nature. Engineering is learning by doing, not by theorizing. Airplanes fly before scientific theories can explain them, that’s my mindset on technical revamping, let’s see what you guys think too.

Stackademic 🎓

Thank you for reading until the end. Before you go:

--

--

Architect at SHOPLINE. Experienced in system design, backend development, and embedded systems. Sponsor me if you like: https://www.buymeacoffee.com/MfGjSk6