HOME / CATALOG / DATA ENGINEERING
01
ROADMAP / ADVANCED

Data Engineering

26 TOPICS · 80 HOURS · ADVANCED · SCALE 1:4
START CANVAS

Build production data pipelines — from SQL fundamentals to streaming at scale.

Master the principles and tools of data engineering — from relational databases and SQL to distributed systems, streaming pipelines, and modern data platforms. You'll learn to design, build, and operate reliable data infrastructure at any scale.


§ SYLLABUS

§ SECTION 01 · DATA FOUNDATIONS
  1. 01
    Relational Databases & SQL

    Understand how relational databases store and organize data using tables, keys, and constraints. Learn to write SQL queries to retrieve, filter, join, and aggregate data confidently.

  2. 02
    Data Modeling & Schema Design

    Learn how to design database schemas that accurately represent real-world relationships. Understand normalization, denormalization, and the trade-offs between them for different workloads.

  3. 03
    NoSQL Databases

    Explore document stores, key-value stores, column-family, and graph databases. Understand when a non-relational model is a better fit than a traditional relational one.

  4. 04
    Python for Data Engineering

    Learn core Python skills used daily in data engineering — file I/O, working with JSON/CSV, using libraries like pandas, and writing clean, testable scripts for data tasks.

§ SECTION 02 · STORAGE & FILE FORMATS
  1. 01
    Data File Formats

    Understand the differences between row-oriented formats (CSV, JSON, Avro) and columnar formats (Parquet, ORC). Learn when to use each format and why it matters for performance.

  2. 02
    Data Serialization & Schemas

    Learn how data is serialized for storage and transmission using formats like Avro, Protobuf, and JSON Schema. Understand schema evolution and why backwards compatibility matters.

  3. 03
    Data Lakes & Object Storage

    Understand how cloud object storage (S3, GCS, ADLS) serves as the foundation for data lakes. Learn about partitioning strategies, lifecycle policies, and organizing raw data at scale.

§ SECTION 03 · BATCH PROCESSING
  1. 01
    ETL vs ELT Patterns

    Understand the two fundamental approaches to moving and transforming data. Learn when to transform before loading (ETL) versus after loading (ELT), and why ELT has become dominant in modern stacks.

  2. 02
    Apache Spark

    Learn how Spark distributes computation across a cluster to process massive datasets. Understand RDDs, DataFrames, transformations, actions, and how to write efficient Spark jobs.

  3. 03
    dbt (Data Build Tool)

    Learn how dbt brings software engineering practices — version control, testing, documentation — to SQL-based data transformations inside your warehouse.

  4. 04
    Hadoop & MapReduce

    Understand the original distributed processing paradigm that started the big data era. Learn about HDFS and MapReduce, and why modern tools have largely replaced them.

§ SECTION 04 · DATA WAREHOUSING
  1. 01
    Data Warehouse Architecture

    Understand how data warehouses differ from operational databases. Learn about dimensional modeling, star and snowflake schemas, and why warehouses are optimized for analytical queries.

  2. 02
    Modern Cloud Warehouses

    Explore platforms like Snowflake, BigQuery, and Redshift. Understand how they separate storage from compute, handle concurrency, and enable scalable analytics without managing infrastructure.

  3. 03
    Open Table Formats

    Learn how Delta Lake, Apache Iceberg, and Hudi bring ACID transactions, time travel, and schema evolution to data lakes — bridging the gap between lakes and warehouses.

§ SECTION 05 · STREAM PROCESSING
  1. 01
    Message Queues & Event Streaming

    Understand how systems like Apache Kafka and Pub/Sub decouple data producers from consumers. Learn about topics, partitions, consumer groups, and delivery guarantees.

  2. 02
    Stream Processing Frameworks

    Learn how frameworks like Flink, Spark Streaming, and Kafka Streams process data continuously in real time. Understand windowing, watermarks, and exactly-once semantics.

  3. 03
    Change Data Capture (CDC)

    Learn how CDC tools like Debezium capture row-level changes from databases and stream them as events. Understand how this enables real-time data replication without heavy batch loads.

§ SECTION 06 · ORCHESTRATION & PIPELINE MANAGEMENT
  1. 01
    Workflow Orchestration

    Understand why data pipelines need orchestrators to manage task dependencies, retries, and scheduling. Learn core concepts using tools like Apache Airflow or Dagster.

  2. 02
    Apache Airflow

    Learn how to define pipelines as code using Airflow's DAGs. Understand operators, sensors, XComs, and how to monitor and troubleshoot pipeline runs.

  3. 03
    Dagster & Prefect

    Explore modern alternatives to Airflow that offer asset-based orchestration, better local development, and native data awareness. Understand when to choose them over Airflow.

§ SECTION 07 · DATA QUALITY & GOVERNANCE
  1. 01
    Data Quality & Testing

    Learn strategies for validating data at every stage of a pipeline — schema checks, freshness monitoring, anomaly detection, and tools like Great Expectations and dbt tests.

  2. 02
    Data Catalogs & Lineage

    Understand how metadata catalogs and lineage graphs help teams discover, trust, and debug data. Explore tools like DataHub, Amundsen, and OpenLineage.

  3. 03
    Data Governance & Compliance

    Learn the principles of data governance — access control, PII handling, retention policies, and regulations like GDPR. Understand how to build pipelines that respect privacy by design.

§ SECTION 08 · INFRASTRUCTURE & OPERATIONS
  1. 01
    Containers & Docker

    Learn how containers package code and dependencies for consistent, reproducible deployments. Understand Dockerfiles, images, and how data tools are commonly containerized.

  2. 02
    Infrastructure as Code & CI/CD

    Learn how to version-control your infrastructure with tools like Terraform and automate pipeline deployments with CI/CD. Understand why infrastructure reproducibility is critical for data teams.

  3. 03
    Pipeline Monitoring & Observability

    Learn how to monitor data pipelines in production — tracking SLAs, setting up alerts for failures and data drift, and building dashboards that give your team confidence in data freshness.