HOME / CATALOG / DATA LAKES & LAKEHOUSES
01
ROADMAP / INTERMEDIATE

Data Lakes & Lakehouses

27 TOPICS · 55 HOURS · INTERMEDIATE · SCALE 1:4
START CANVAS

A comprehensive learning path covering data lake architectures, lakehouse patterns, and the modern open table format ecosystem. You'll progress from foundational storage concepts through Delta Lake, Apache Iceberg, and Hudi, to production-grade lakehouse design with governance, performance tuning, and real-time ingestion.


§ SYLLABUS

§ SECTION 01 · STORAGE & DATA FOUNDATIONS
  1. 01
    Distributed Storage Fundamentals

    Understand how distributed file systems like HDFS and cloud object stores (S3, ADLS, GCS) work — block vs. object storage, eventual consistency, and how data is physically organized.

  2. 02
    Columnar File Formats

    Learn why columnar formats like Parquet and ORC dominate analytical workloads — row groups, column chunks, encoding schemes, and how predicate pushdown works at the file level.

  3. 03
    Data Partitioning Strategies

    Master how partitioning by date, region, or other keys reduces scan volume, and understand the trade-offs between too many small files and too few large ones.

  4. 04
    Data Warehouse vs. Data Lake

    Compare the traditional data warehouse model (schema-on-write, structured) with the data lake approach (schema-on-read, multi-format) and understand where each excels.

§ SECTION 02 · DATA LAKE ARCHITECTURE
  1. 01
    Lake Zones: Raw, Curated & Consumption

    Learn the multi-zone pattern — landing raw data, cleaning it into curated layers, and serving aggregated datasets for consumption — and why this layering prevents chaos.

  2. 02
    Batch & Streaming Ingestion

    Understand the key ingestion modes — scheduled batch loads, CDC streams, and micro-batch — and when to use each for landing data into a lake.

  3. 03
    Schema Evolution & Enforcement

    Learn how schemas drift over time in a lake, the problems this creates, and the mechanisms (schema registries, merge rules) that keep data compatible.

  4. 04
    Metadata & Data Catalogs

    Understand how tools like AWS Glue Catalog, Hive Metastore, and Unity Catalog track table locations, schemas, and statistics so engines can discover and query lake data.

  5. 05
    Data Swamp Anti-Patterns

    Identify the common failures that turn a data lake into an unusable swamp — missing metadata, no ownership, unbounded schema drift, and lack of quality checks.

§ SECTION 03 · OPEN TABLE FORMATS
  1. 01
    What Open Table Formats Solve

    Understand the core problem — ACID transactions, time travel, and efficient upserts on top of immutable object storage — and why plain Parquet directories aren't enough.

  2. 02
    Delta Lake Deep Dive

    Learn Delta Lake's transaction log, optimistic concurrency, Z-ordering, VACUUM, and MERGE semantics. Understand its tight Spark integration and Databricks ecosystem.

  3. 03
    Apache Iceberg Deep Dive

    Explore Iceberg's snapshot-based metadata tree, hidden partitioning, partition evolution, and multi-engine compatibility across Spark, Trino, Flink, and more.

  4. 04
    Apache Hudi Overview

    Understand Hudi's copy-on-write vs. merge-on-read table types, its record-level indexing, incremental queries, and strengths for CDC-heavy workloads.

  5. 05
    Comparing Delta, Iceberg & Hudi

    Evaluate the three major table formats side by side on concurrency, ecosystem support, partition evolution, and community momentum to make an informed choice.

§ SECTION 04 · LAKEHOUSE ARCHITECTURE
  1. 01
    The Lakehouse Paradigm

    Understand how the lakehouse combines the low-cost storage of a data lake with the reliability, performance, and governance features of a data warehouse.

  2. 02
    Medallion Architecture (Bronze/Silver/Gold)

    Learn the medallion pattern for incrementally refining data quality — bronze for raw ingestion, silver for validated and conformed data, gold for business-level aggregates.

  3. 03
    Query Engines for Lakehouses

    Survey the engines that query lakehouse tables — Spark, Trino/Presto, Dremio, StarRocks — and understand the trade-offs between throughput, latency, and cost.

  4. 04
    SQL Analytics on the Lakehouse

    Learn how to run interactive SQL over lakehouse tables, including BI tool integration, caching layers, and materialized views for sub-second dashboards.

§ SECTION 05 · GOVERNANCE, SECURITY & DATA QUALITY
  1. 01
    Fine-Grained Access Control

    Understand column-level and row-level security, role-based policies, and how catalogs like Unity Catalog or Apache Ranger enforce permissions across multiple engines.

  2. 02
    Data Quality & Expectations

    Learn to define and enforce data quality rules — null checks, range validations, referential integrity — using tools like Great Expectations or Delta Live Tables expectations.

  3. 03
    Data Lineage & Observability

    Trace how data flows from source to consumption, detect stale or broken pipelines, and integrate lineage tools like OpenLineage and DataHub into your lakehouse.

§ SECTION 06 · PERFORMANCE & OPTIMIZATION
  1. 01
    Compaction & Small File Problem

    Understand why streaming ingestion creates many tiny files, how compaction (bin-packing) merges them, and how to schedule maintenance without disrupting readers.

  2. 02
    Data Skipping & Clustering

    Learn how min/max statistics, bloom filters, Z-ordering, and liquid clustering let engines skip irrelevant files and dramatically reduce I/O.

  3. 03
    Caching & Query Acceleration

    Explore local SSD caching, result caching, and acceleration layers (Alluxio, Delta Cache) that bridge the gap between object store latency and interactive query expectations.

§ SECTION 07 · REAL-TIME & ADVANCED PATTERNS
  1. 01
    Streaming into the Lakehouse

    Learn how Spark Structured Streaming, Flink, and Kafka Connect write directly into Delta/Iceberg tables with exactly-once semantics and low-latency availability.

  2. 02
    Change Data Capture Pipelines

    Build CDC pipelines that replicate operational databases into lakehouse tables using Debezium, Airbyte, or native connectors, keeping the lake in near-real-time sync.

  3. 03
    ML & Feature Engineering on the Lakehouse

    Understand how to use lakehouse tables as feature stores, run distributed training on lake data, and version ML datasets alongside model artifacts.