Member-only story

Benchmarking Lakehouse with TPC-DS

A step-by-step guide to importing TPC-DS data into Apache Iceberg

Chunting Wu
4 min readFeb 10, 2025
My girl

Not a member? You can still check out this article through here.

Apache Iceberg is already a popular lakehouse format that is supported by many query engines. What should we do if we want to make a technical selection among many query engines?

In the data warehouse domain, the most commonly used standard is TPC-DS, which defines several common scenarios and provides a set of standardized queries. Generally speaking, TPC-DS is the gold standard for benchmarking performance.

Although TPC-DS is quite popular and there are many common connectors for dumping test data into various databases, and even Trino, a pure computing engine, provides a dedicated catalog for TPC-DS, there is no such thing as a TPC-DS for lakehouse at the moment.

Lakehouse does not have a good connector for this purpose. Therefore, in this article we will try to describe how to dump the test data of TPC-DS into Iceberg’s lakehouse.

Experiment environment setup

Regarding how to build the TPC-DS tools is not the focus of this article, so I’ll start by assuming that dsdgen is already installed.

--

--

Chunting Wu
Chunting Wu

Written by Chunting Wu

Architect at SHOPLINE. Experienced in system design, backend development, and data engineering.

No responses yet