Building High-Performance Data Pipelines with DuckDB: My Hands-On Experience

Nov 27, 2024

yellow rubber duck on white table — Photo by S. Tsuchiya on Unsplash

Hi, this is James with an issue of the talk data to me, lol Newsletter. In every issue, I cover topics related to data, & analytics through the lens of a data engineer. If you're into data engineering, architecture, algorithms, infrastructure, and dashboards, then subscribe here. Please connect with me via my LinkedIn here.

As data engineers and professionals, we are constantly on the lookout for tools and technologies that can streamline our data processing workflows, especially when dealing with large volumes of data. In this blog post, I will share my experience of building data pipelines using DuckDB, a high-performance, embeddable analytical database engine that has been making waves in the data engineering community.

Why DuckDB?

DuckDB stands out for its ability to process large datasets on a single machine, using standard SQL. It is designed for analytical loads and can handle data in various formats such as CSV and Parquet. One of the key advantages of DuckDB is its lightweight setup and powerful capabilities, making it an excellent choice for medium to large-scale data projects.

Data Lake Architecture

For my project, I adopted a multi-tier architecture, often referred to as the Medallion architecture. This design pattern organizes data into three zones:

Bronze Zone: This zone contains raw, unprocessed data ingested from various sources.
Silver Zone: Here, the data is cleaned, conformed, and potentially modeled.
Gold Zone: This zone contains aggregated and curated data ready for data products, reporting, dashboards, and advanced analytics.

Each zone uses an appropriate partitioning scheme to optimize data ingestion, query performance, and overall efficiency. For example, in the Bronze Zone, data is partitioned by both day and hour, while in the Gold Zone, it is partitioned only by day.

Data Pipeline Steps

Here’s a high-level overview of the steps involved in my data pipeline:

Step 1: Data Ingestion

I started by ingesting hourly datasets from sources like the GitHub Archive. This involved loading the data into the Bronze Zone of my data lake. DuckDB simplifies this process with a single COPY query that reads the data from the source and writes it to Parquet files, which are more efficient for handling large datasets.

Step 2: Data Transformation

The next step was to transform the raw data into a more usable form. This included filtering, cleaning, and conforming the data to isolate only relevant events and remove irrelevant information. DuckDB’s SQL capabilities made this process straightforward, allowing me to use text and date manipulations to extract the necessary data.

Step 3: Data Aggregation

Finally, I computed the metrics and built an OLAP Cube using a GROUP BY ROLLUP query. DuckDB’s performance in this step was impressive, as it could group over 200 million rows in four different levels of granularity within less than a minute.

Handling Data Types and Errors

One of the challenges I faced was dealing with inferred data types in DuckDB. In some cases, the first few records of a batch might have one data type, while the rest could have a different type. To handle this, I opted to provide explicit schemas or cast everything to VARCHAR initially and set the type later in the Silver Zone. This approach ensures that no data is lost due to type incompatibilities.

Performance and Cost Efficiency

The total execution time for my pipeline was approximately 2.5 hours for half a terabyte of data, which is quite fast considering the volume and complexity of the data. This performance, combined with the fact that DuckDB is an open-source tool with zero compute costs, makes it an extremely cost-efficient solution for many companies.

Comparison with Other Tools

Often, the question arises whether to use DuckDB or other tools like Postgres or Spark. Here’s my take:

Postgres: While Postgres is a full-fledged database server, DuckDB is an embedded database that runs on your local machine. If you need to handle larger-than-memory datasets, DuckDB is more suitable.
Spark: For very large data volumes, Spark might be necessary, but for small to medium scale projects, DuckDB offers a more streamlined and cost-effective solution.

Additional Features and Tips

DuckDB has several features that make it a powerful tool:

SQL Support: DuckDB offers great SQL support out of the box, including federated queries and an internal columnar database.
Custom Functions: You can register custom Python functions and call them in SQL, which is incredibly useful.
Profiling and Optimization: DuckDB allows you to enable profiling and customize the profiling format, which is helpful for optimizing query performance.
Memory Management: You can set memory limits and manage threads for parallel query execution, ensuring efficient resource usage.

Conclusion

DuckDB has proven to be an essential tool in my data engineering toolkit. Its ease of use, powerful SQL capabilities, and cost efficiency make it an ideal choice for building high-performance data pipelines. Whether you are dealing with smaller datasets or larger-than-memory datasets, DuckDB is definitely worth considering. If you are new to DuckDB, I recommend checking out the official documentation.

Do you have any experience with DuckDB? Please share your comments below.

talk data to me, lol