Member-only story

Benchmarking Lakehouse with TPC-DS

A step-by-step guide to importing TPC-DS data into Apache Iceberg

Chunting Wu
4 min readFeb 10, 2025
My girl

Not a member? You can still check out this article through here.

Apache Iceberg is already a popular lakehouse format that is supported by many query engines. What should we do if we want to make a technical selection among many query engines?

In the data warehouse domain, the most commonly used standard is TPC-DS, which defines several common scenarios and provides a set of standardized queries. Generally speaking, TPC-DS is the gold standard for benchmarking performance.

Although TPC-DS is quite popular and there are many common connectors for dumping test data into various databases, and even Trino, a pure computing engine, provides a dedicated catalog for TPC-DS, there is no such thing as a TPC-DS for lakehouse at the moment.

Lakehouse does not have a good connector for this purpose. Therefore, in this article we will try to describe how to dump the test data of TPC-DS into Iceberg’s lakehouse.

Experiment environment setup

Regarding how to build the TPC-DS tools is not the focus of this article, so I’ll start by assuming that dsdgen is already installed.

Chunting Wu
Chunting Wu

Written by Chunting Wu

Architect at SHOPLINE. Experienced in system design, backend development, and data engineering.

No responses yet

Write a response