Data Engineering Project

Data Engineering Project
  1. 🚩 Goal
  2. 📊 Tools primer
  3. 🖥️ TPC-H Data
  4. 🖥️ Process

data engineeringdbtsnowflake

🚩 Goal

In this project, I design an ELT pipeline using Snowflake as the cloud data warehouse. I define my transformation pipelines in a dbt project, and orchestrate the pipelines using Airflow and Astronomer Cosmos.

The decision to use ETL or ELT informs subsequent considerations about data pipeline architecture and implementation.

This project showcases the ELT approach, which has grown in popularity with the advent of massively parallel processing databases and cloud infrastructure.

The crux of the ELT process is to load the data directly into your data warehouse, as opposed to maintaining a dedicated server responsible for processing and transforming your data.

Visualisation of the difference in ETL and ELT

This key difference is ELT’s most compelling case - when data is loaded with minimal processing into the data warehouse, its raw state is retained, meaning that downstream pipelines can be recreated, rewritten, or dropped easily in response to changing business priorities.

When used in conjunction with tools like dbt, ELT shifts the responsibility of developing data pipelines to Analytics Engineers / Data Analysts, individuals who understand the business better than Data Engineers, but might not have the technical chops to write scalable, robust ETL code.

📊 Tools primer

Snowflake is a cloud data platform, which is a key component of the modern data stack. It supports transformation workloads written in SQL, along with a suite of other nifty features, such as autoscaling, a RESTful API, and extension to a broad ecosystem of 3rd party tools and technologies.

Airflow is an open source workflow management tool to orchestrate data engineering pipelines. To allow better visibility into the lineage of tasks, this project uses Astronomer Cosmos to integrate dbt and Airflow.

🖥️ TPC-H Data

The TPC-H benchmark is a widely recognised tool for evaluating the performance of data processing systems. It consists of a suite of decision support queries based on a multi-table schema.

I chose to build the project on the TPC-H dataset due to the following reasons:

Realism: TPC-H simulates real-world business operations, entities, and relationships

Integration with Snowflake: TPC-H data tables are available as sample data within Snowflake

Size: Because of the size, building data pipelines from the TPC-H data requires efficiency in data processing and transformation jobs

Schema of the TPC-H dataset

🖥️ Process

Spinning up a local instance of the project

Explore the project here


Further Reading

[1] A brief introduction to Airflow internals / architecture can be found here

[2] A more detailed comparison of ETL vs ELT can be found here


© 2024. All rights reserved.