Build production data pipelines — from SQL fundamentals to streaming at scale.
Master the principles and tools of data engineering — from relational databases and SQL to distributed systems, streaming pipelines, and modern data platforms. You'll learn to design, build, and operate reliable data infrastructure at any scale.
§ SYLLABUS
- 01Relational Databases & SQL
Understand how relational databases store and organize data using tables, keys, and constraints. Learn to write SQL queries to retrieve, filter, join, and aggregate data confidently.
- 02Data Modeling & Schema Design
Learn how to design database schemas that accurately represent real-world relationships. Understand normalization, denormalization, and the trade-offs between them for different workloads.
- 03NoSQL Databases
Explore document stores, key-value stores, column-family, and graph databases. Understand when a non-relational model is a better fit than a traditional relational one.
- 04Python for Data Engineering
Learn core Python skills used daily in data engineering — file I/O, working with JSON/CSV, using libraries like pandas, and writing clean, testable scripts for data tasks.
- 01Data File Formats
Understand the differences between row-oriented formats (CSV, JSON, Avro) and columnar formats (Parquet, ORC). Learn when to use each format and why it matters for performance.
- 02Data Serialization & Schemas
Learn how data is serialized for storage and transmission using formats like Avro, Protobuf, and JSON Schema. Understand schema evolution and why backwards compatibility matters.
- 03Data Lakes & Object Storage
Understand how cloud object storage (S3, GCS, ADLS) serves as the foundation for data lakes. Learn about partitioning strategies, lifecycle policies, and organizing raw data at scale.
- 01ETL vs ELT Patterns
Understand the two fundamental approaches to moving and transforming data. Learn when to transform before loading (ETL) versus after loading (ELT), and why ELT has become dominant in modern stacks.
- 02Apache Spark
Learn how Spark distributes computation across a cluster to process massive datasets. Understand RDDs, DataFrames, transformations, actions, and how to write efficient Spark jobs.
- 03dbt (Data Build Tool)
Learn how dbt brings software engineering practices — version control, testing, documentation — to SQL-based data transformations inside your warehouse.
- 04Hadoop & MapReduce
Understand the original distributed processing paradigm that started the big data era. Learn about HDFS and MapReduce, and why modern tools have largely replaced them.
- 01Data Warehouse Architecture
Understand how data warehouses differ from operational databases. Learn about dimensional modeling, star and snowflake schemas, and why warehouses are optimized for analytical queries.
- 02Modern Cloud Warehouses
Explore platforms like Snowflake, BigQuery, and Redshift. Understand how they separate storage from compute, handle concurrency, and enable scalable analytics without managing infrastructure.
- 03Open Table Formats
Learn how Delta Lake, Apache Iceberg, and Hudi bring ACID transactions, time travel, and schema evolution to data lakes — bridging the gap between lakes and warehouses.
- 01Message Queues & Event Streaming
Understand how systems like Apache Kafka and Pub/Sub decouple data producers from consumers. Learn about topics, partitions, consumer groups, and delivery guarantees.
- 02Stream Processing Frameworks
Learn how frameworks like Flink, Spark Streaming, and Kafka Streams process data continuously in real time. Understand windowing, watermarks, and exactly-once semantics.
- 03Change Data Capture (CDC)
Learn how CDC tools like Debezium capture row-level changes from databases and stream them as events. Understand how this enables real-time data replication without heavy batch loads.
- 01Workflow Orchestration
Understand why data pipelines need orchestrators to manage task dependencies, retries, and scheduling. Learn core concepts using tools like Apache Airflow or Dagster.
- 02Apache Airflow
Learn how to define pipelines as code using Airflow's DAGs. Understand operators, sensors, XComs, and how to monitor and troubleshoot pipeline runs.
- 03Dagster & Prefect
Explore modern alternatives to Airflow that offer asset-based orchestration, better local development, and native data awareness. Understand when to choose them over Airflow.
- 01Data Quality & Testing
Learn strategies for validating data at every stage of a pipeline — schema checks, freshness monitoring, anomaly detection, and tools like Great Expectations and dbt tests.
- 02Data Catalogs & Lineage
Understand how metadata catalogs and lineage graphs help teams discover, trust, and debug data. Explore tools like DataHub, Amundsen, and OpenLineage.
- 03Data Governance & Compliance
Learn the principles of data governance — access control, PII handling, retention policies, and regulations like GDPR. Understand how to build pipelines that respect privacy by design.
- 01Containers & Docker
Learn how containers package code and dependencies for consistent, reproducible deployments. Understand Dockerfiles, images, and how data tools are commonly containerized.
- 02Infrastructure as Code & CI/CD
Learn how to version-control your infrastructure with tools like Terraform and automate pipeline deployments with CI/CD. Understand why infrastructure reproducibility is critical for data teams.
- 03Pipeline Monitoring & Observability
Learn how to monitor data pipelines in production — tracking SLAs, setting up alerts for failures and data drift, and building dashboards that give your team confidence in data freshness.