intermediate26 topics~80 hours
Data Engineering
Build production data pipelines — from SQL fundamentals to streaming at scale.
Master the principles and tools of data engineering — from relational databases and SQL to distributed systems, streaming pipelines, and modern data platforms. You'll learn to design, build, and operate reliable data infrastructure at any scale.
Start learningWhat you'll learn
Section 1 · Data Foundations
- 01Relational Databases & SQLUnderstand how relational databases store and organize data using tables, keys, and constraints. Learn to write SQL queries to retrieve, filter, join, and aggregate data confidently.
- 02Data Modeling & Schema DesignLearn how to design database schemas that accurately represent real-world relationships. Understand normalization, denormalization, and the trade-offs between them for different workloads.
- 03NoSQL DatabasesExplore document stores, key-value stores, column-family, and graph databases. Understand when a non-relational model is a better fit than a traditional relational one.
- 04Python for Data EngineeringLearn core Python skills used daily in data engineering — file I/O, working with JSON/CSV, using libraries like pandas, and writing clean, testable scripts for data tasks.
Section 2 · Storage & File Formats
- 01Data File FormatsUnderstand the differences between row-oriented formats (CSV, JSON, Avro) and columnar formats (Parquet, ORC). Learn when to use each format and why it matters for performance.
- 02Data Serialization & SchemasLearn how data is serialized for storage and transmission using formats like Avro, Protobuf, and JSON Schema. Understand schema evolution and why backwards compatibility matters.
- 03Data Lakes & Object StorageUnderstand how cloud object storage (S3, GCS, ADLS) serves as the foundation for data lakes. Learn about partitioning strategies, lifecycle policies, and organizing raw data at scale.
Section 3 · Batch Processing
- 01ETL vs ELT PatternsUnderstand the two fundamental approaches to moving and transforming data. Learn when to transform before loading (ETL) versus after loading (ELT), and why ELT has become dominant in modern stacks.
- 02Apache SparkLearn how Spark distributes computation across a cluster to process massive datasets. Understand RDDs, DataFrames, transformations, actions, and how to write efficient Spark jobs.
- 03dbt (Data Build Tool)Learn how dbt brings software engineering practices — version control, testing, documentation — to SQL-based data transformations inside your warehouse.
- 04Hadoop & MapReduceUnderstand the original distributed processing paradigm that started the big data era. Learn about HDFS and MapReduce, and why modern tools have largely replaced them.
Section 4 · Data Warehousing
- 01Data Warehouse ArchitectureUnderstand how data warehouses differ from operational databases. Learn about dimensional modeling, star and snowflake schemas, and why warehouses are optimized for analytical queries.
- 02Modern Cloud WarehousesExplore platforms like Snowflake, BigQuery, and Redshift. Understand how they separate storage from compute, handle concurrency, and enable scalable analytics without managing infrastructure.
- 03Open Table FormatsLearn how Delta Lake, Apache Iceberg, and Hudi bring ACID transactions, time travel, and schema evolution to data lakes — bridging the gap between lakes and warehouses.
Section 5 · Stream Processing
- 01Message Queues & Event StreamingUnderstand how systems like Apache Kafka and Pub/Sub decouple data producers from consumers. Learn about topics, partitions, consumer groups, and delivery guarantees.
- 02Stream Processing FrameworksLearn how frameworks like Flink, Spark Streaming, and Kafka Streams process data continuously in real time. Understand windowing, watermarks, and exactly-once semantics.
- 03Change Data Capture (CDC)Learn how CDC tools like Debezium capture row-level changes from databases and stream them as events. Understand how this enables real-time data replication without heavy batch loads.
Section 6 · Orchestration & Pipeline Management
- 01Workflow OrchestrationUnderstand why data pipelines need orchestrators to manage task dependencies, retries, and scheduling. Learn core concepts using tools like Apache Airflow or Dagster.
- 02Apache AirflowLearn how to define pipelines as code using Airflow's DAGs. Understand operators, sensors, XComs, and how to monitor and troubleshoot pipeline runs.
- 03Dagster & PrefectExplore modern alternatives to Airflow that offer asset-based orchestration, better local development, and native data awareness. Understand when to choose them over Airflow.
Section 7 · Data Quality & Governance
- 01Data Quality & TestingLearn strategies for validating data at every stage of a pipeline — schema checks, freshness monitoring, anomaly detection, and tools like Great Expectations and dbt tests.
- 02Data Catalogs & LineageUnderstand how metadata catalogs and lineage graphs help teams discover, trust, and debug data. Explore tools like DataHub, Amundsen, and OpenLineage.
- 03Data Governance & ComplianceLearn the principles of data governance — access control, PII handling, retention policies, and regulations like GDPR. Understand how to build pipelines that respect privacy by design.
Section 8 · Infrastructure & Operations
- 01Containers & DockerLearn how containers package code and dependencies for consistent, reproducible deployments. Understand Dockerfiles, images, and how data tools are commonly containerized.
- 02Infrastructure as Code & CI/CDLearn how to version-control your infrastructure with tools like Terraform and automate pipeline deployments with CI/CD. Understand why infrastructure reproducibility is critical for data teams.
- 03Pipeline Monitoring & ObservabilityLearn how to monitor data pipelines in production — tracking SLAs, setting up alerts for failures and data drift, and building dashboards that give your team confidence in data freshness.