All roadmaps
intermediate26 topics~80 hours

Data Engineering

Build production data pipelines — from SQL fundamentals to streaming at scale.

Master the principles and tools of data engineering — from relational databases and SQL to distributed systems, streaming pipelines, and modern data platforms. You'll learn to design, build, and operate reliable data infrastructure at any scale.

Start learning

What you'll learn

Section 1 · Data Foundations

  1. 01
    Relational Databases & SQL
    Understand how relational databases store and organize data using tables, keys, and constraints. Learn to write SQL queries to retrieve, filter, join, and aggregate data confidently.
  2. 02
    Data Modeling & Schema Design
    Learn how to design database schemas that accurately represent real-world relationships. Understand normalization, denormalization, and the trade-offs between them for different workloads.
  3. 03
    NoSQL Databases
    Explore document stores, key-value stores, column-family, and graph databases. Understand when a non-relational model is a better fit than a traditional relational one.
  4. 04
    Python for Data Engineering
    Learn core Python skills used daily in data engineering — file I/O, working with JSON/CSV, using libraries like pandas, and writing clean, testable scripts for data tasks.

Section 2 · Storage & File Formats

  1. 01
    Data File Formats
    Understand the differences between row-oriented formats (CSV, JSON, Avro) and columnar formats (Parquet, ORC). Learn when to use each format and why it matters for performance.
  2. 02
    Data Serialization & Schemas
    Learn how data is serialized for storage and transmission using formats like Avro, Protobuf, and JSON Schema. Understand schema evolution and why backwards compatibility matters.
  3. 03
    Data Lakes & Object Storage
    Understand how cloud object storage (S3, GCS, ADLS) serves as the foundation for data lakes. Learn about partitioning strategies, lifecycle policies, and organizing raw data at scale.

Section 3 · Batch Processing

  1. 01
    ETL vs ELT Patterns
    Understand the two fundamental approaches to moving and transforming data. Learn when to transform before loading (ETL) versus after loading (ELT), and why ELT has become dominant in modern stacks.
  2. 02
    Apache Spark
    Learn how Spark distributes computation across a cluster to process massive datasets. Understand RDDs, DataFrames, transformations, actions, and how to write efficient Spark jobs.
  3. 03
    dbt (Data Build Tool)
    Learn how dbt brings software engineering practices — version control, testing, documentation — to SQL-based data transformations inside your warehouse.
  4. 04
    Hadoop & MapReduce
    Understand the original distributed processing paradigm that started the big data era. Learn about HDFS and MapReduce, and why modern tools have largely replaced them.

Section 4 · Data Warehousing

  1. 01
    Data Warehouse Architecture
    Understand how data warehouses differ from operational databases. Learn about dimensional modeling, star and snowflake schemas, and why warehouses are optimized for analytical queries.
  2. 02
    Modern Cloud Warehouses
    Explore platforms like Snowflake, BigQuery, and Redshift. Understand how they separate storage from compute, handle concurrency, and enable scalable analytics without managing infrastructure.
  3. 03
    Open Table Formats
    Learn how Delta Lake, Apache Iceberg, and Hudi bring ACID transactions, time travel, and schema evolution to data lakes — bridging the gap between lakes and warehouses.

Section 5 · Stream Processing

  1. 01
    Message Queues & Event Streaming
    Understand how systems like Apache Kafka and Pub/Sub decouple data producers from consumers. Learn about topics, partitions, consumer groups, and delivery guarantees.
  2. 02
    Stream Processing Frameworks
    Learn how frameworks like Flink, Spark Streaming, and Kafka Streams process data continuously in real time. Understand windowing, watermarks, and exactly-once semantics.
  3. 03
    Change Data Capture (CDC)
    Learn how CDC tools like Debezium capture row-level changes from databases and stream them as events. Understand how this enables real-time data replication without heavy batch loads.

Section 6 · Orchestration & Pipeline Management

  1. 01
    Workflow Orchestration
    Understand why data pipelines need orchestrators to manage task dependencies, retries, and scheduling. Learn core concepts using tools like Apache Airflow or Dagster.
  2. 02
    Apache Airflow
    Learn how to define pipelines as code using Airflow's DAGs. Understand operators, sensors, XComs, and how to monitor and troubleshoot pipeline runs.
  3. 03
    Dagster & Prefect
    Explore modern alternatives to Airflow that offer asset-based orchestration, better local development, and native data awareness. Understand when to choose them over Airflow.

Section 7 · Data Quality & Governance

  1. 01
    Data Quality & Testing
    Learn strategies for validating data at every stage of a pipeline — schema checks, freshness monitoring, anomaly detection, and tools like Great Expectations and dbt tests.
  2. 02
    Data Catalogs & Lineage
    Understand how metadata catalogs and lineage graphs help teams discover, trust, and debug data. Explore tools like DataHub, Amundsen, and OpenLineage.
  3. 03
    Data Governance & Compliance
    Learn the principles of data governance — access control, PII handling, retention policies, and regulations like GDPR. Understand how to build pipelines that respect privacy by design.

Section 8 · Infrastructure & Operations

  1. 01
    Containers & Docker
    Learn how containers package code and dependencies for consistent, reproducible deployments. Understand Dockerfiles, images, and how data tools are commonly containerized.
  2. 02
    Infrastructure as Code & CI/CD
    Learn how to version-control your infrastructure with tools like Terraform and automate pipeline deployments with CI/CD. Understand why infrastructure reproducibility is critical for data teams.
  3. 03
    Pipeline Monitoring & Observability
    Learn how to monitor data pipelines in production — tracking SLAs, setting up alerts for failures and data drift, and building dashboards that give your team confidence in data freshness.