Member-only story

Apache Paimon with Flink & Trino: A Streaming Lakehouse Playground

A hands-on guide to integrating Apache Paimon, Flink, and Trino for efficient streaming and querying in data lakehouses.

--

My girl

Not a member? You can still check out this article through here.

Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing. Overall, Paimon has the potential to replace the existing Iceberg as the new standard for data lakehousing.

Why Iceberg and not the other two (Hudi and Delta Lake)?

Iceberg is the most widely supported by various open-source engines, including pure query engines (e.g., Trino), New SQL databases (e.g., StarRocks, Doris), and streaming frameworks (e.g., Flink, Spark), all of which support Iceberg.

However, Iceberg faces several problems in streaming scenarios, the most serious one is the fragmentation of small files. Queries in data lakehouses rely heavily on file reads, and if a query has to scan many files at once, it will of course perform poorly.

To address this issue, an external orchestrator is required to regularly merge files. Paimon is designed with a built-in merge mechanism, and many other optimizations for mass writes, making it more adaptable to streaming scenarios.

Experiment environment

In order to learn more about Iceberg, I have set up two experimental environments.

This time I also built a playground for Paimon, which also includes Trino and Flink.

In addition, StarRocks was also put in as a representative of New SQL.

--

--

Chunting Wu
Chunting Wu

Written by Chunting Wu

Architect at SHOPLINE. Experienced in system design, backend development, and data engineering.

No responses yet