Build production data pipelines — from SQL fundamentals to streaming at scale.

Master the principles and tools of data engineering — from relational databases and SQL to distributed systems, streaming pipelines, and modern data platforms. You'll learn to design, build, and operate reliable data infrastructure at any scale.

§ SYLLABUS

§ SECTION 01 · DATA FOUNDATIONS

01
Relational Databases & SQL
Understand how relational databases store and organize data using tables, keys, and constraints. Learn to write SQL queries to retrieve, filter, join, and aggregate data confidently.
02
Data Modeling & Schema Design
Learn how to design database schemas that accurately represent real-world relationships. Understand normalization, denormalization, and the trade-offs between them for different workloads.
03
NoSQL Databases
Explore document stores, key-value stores, column-family, and graph databases. Understand when a non-relational model is a better fit than a traditional relational one.
04
Python for Data Engineering
Learn core Python skills used daily in data engineering — file I/O, working with JSON/CSV, using libraries like pandas, and writing clean, testable scripts for data tasks.

§ SECTION 02 · STORAGE & FILE FORMATS

01
Data File Formats
Understand the differences between row-oriented formats (CSV, JSON, Avro) and columnar formats (Parquet, ORC). Learn when to use each format and why it matters for performance.
02
Data Serialization & Schemas
Learn how data is serialized for storage and transmission using formats like Avro, Protobuf, and JSON Schema. Understand schema evolution and why backwards compatibility matters.
03
Data Lakes & Object Storage
Understand how cloud object storage (S3, GCS, ADLS) serves as the foundation for data lakes. Learn about partitioning strategies, lifecycle policies, and organizing raw data at scale.

§ SECTION 03 · BATCH PROCESSING

01
ETL vs ELT Patterns
Understand the two fundamental approaches to moving and transforming data. Learn when to transform before loading (ETL) versus after loading (ELT), and why ELT has become dominant in modern stacks.
02
Apache Spark
Learn how Spark distributes computation across a cluster to process massive datasets. Understand RDDs, DataFrames, transformations, actions, and how to write efficient Spark jobs.
03
dbt (Data Build Tool)
Learn how dbt brings software engineering practices — version control, testing, documentation — to SQL-based data transformations inside your warehouse.
04
Hadoop & MapReduce
Understand the original distributed processing paradigm that started the big data era. Learn about HDFS and MapReduce, and why modern tools have largely replaced them.

§ SECTION 04 · DATA WAREHOUSING

01
Data Warehouse Architecture
Understand how data warehouses differ from operational databases. Learn about dimensional modeling, star and snowflake schemas, and why warehouses are optimized for analytical queries.
02
Modern Cloud Warehouses
Explore platforms like Snowflake, BigQuery, and Redshift. Understand how they separate storage from compute, handle concurrency, and enable scalable analytics without managing infrastructure.
03
Open Table Formats
Learn how Delta Lake, Apache Iceberg, and Hudi bring ACID transactions, time travel, and schema evolution to data lakes — bridging the gap between lakes and warehouses.

§ SECTION 05 · STREAM PROCESSING

01
Message Queues & Event Streaming
Understand how systems like Apache Kafka and Pub/Sub decouple data producers from consumers. Learn about topics, partitions, consumer groups, and delivery guarantees.
02
Stream Processing Frameworks
Learn how frameworks like Flink, Spark Streaming, and Kafka Streams process data continuously in real time. Understand windowing, watermarks, and exactly-once semantics.
03
Change Data Capture (CDC)
Learn how CDC tools like Debezium capture row-level changes from databases and stream them as events. Understand how this enables real-time data replication without heavy batch loads.

§ SECTION 06 · ORCHESTRATION & PIPELINE MANAGEMENT

01
Workflow Orchestration
Understand why data pipelines need orchestrators to manage task dependencies, retries, and scheduling. Learn core concepts using tools like Apache Airflow or Dagster.
02
Apache Airflow
Learn how to define pipelines as code using Airflow's DAGs. Understand operators, sensors, XComs, and how to monitor and troubleshoot pipeline runs.
03
Dagster & Prefect
Explore modern alternatives to Airflow that offer asset-based orchestration, better local development, and native data awareness. Understand when to choose them over Airflow.

§ SECTION 07 · DATA QUALITY & GOVERNANCE

01
Data Quality & Testing
Learn strategies for validating data at every stage of a pipeline — schema checks, freshness monitoring, anomaly detection, and tools like Great Expectations and dbt tests.
02
Data Catalogs & Lineage
Understand how metadata catalogs and lineage graphs help teams discover, trust, and debug data. Explore tools like DataHub, Amundsen, and OpenLineage.
03
Data Governance & Compliance
Learn the principles of data governance — access control, PII handling, retention policies, and regulations like GDPR. Understand how to build pipelines that respect privacy by design.

§ SECTION 08 · INFRASTRUCTURE & OPERATIONS

01
Containers & Docker
Learn how containers package code and dependencies for consistent, reproducible deployments. Understand Dockerfiles, images, and how data tools are commonly containerized.
02
Infrastructure as Code & CI/CD
Learn how to version-control your infrastructure with tools like Terraform and automate pipeline deployments with CI/CD. Understand why infrastructure reproducibility is critical for data teams.
03
Pipeline Monitoring & Observability
Learn how to monitor data pipelines in production — tracking SLAs, setting up alerts for failures and data drift, and building dashboards that give your team confidence in data freshness.