Member-only story
Benchmarking Lakehouse with TPC-DS
A step-by-step guide to importing TPC-DS data into Apache Iceberg
Not a member? You can still check out this article through here.
Apache Iceberg is already a popular lakehouse format that is supported by many query engines. What should we do if we want to make a technical selection among many query engines?
In the data warehouse domain, the most commonly used standard is TPC-DS, which defines several common scenarios and provides a set of standardized queries. Generally speaking, TPC-DS is the gold standard for benchmarking performance.
Although TPC-DS is quite popular and there are many common connectors for dumping test data into various databases, and even Trino, a pure computing engine, provides a dedicated catalog for TPC-DS, there is no such thing as a TPC-DS for lakehouse at the moment.
Lakehouse does not have a good connector for this purpose. Therefore, in this article we will try to describe how to dump the test data of TPC-DS into Iceberg’s lakehouse.
Experiment environment setup
Regarding how to build the TPC-DS tools is not the focus of this article, so I’ll start by assuming that dsdgen
is already installed.