A comprehensive learning path covering data lake architectures, lakehouse patterns, and the modern open table format ecosystem. You'll progress from foundational storage concepts through Delta Lake, Apache Iceberg, and Hudi, to production-grade lakehouse design with governance, performance tuning, and real-time ingestion.

§ SYLLABUS

§ SECTION 01 · STORAGE & DATA FOUNDATIONS

01
Distributed Storage Fundamentals
Understand how distributed file systems like HDFS and cloud object stores (S3, ADLS, GCS) work — block vs. object storage, eventual consistency, and how data is physically organized.
02
Columnar File Formats
Learn why columnar formats like Parquet and ORC dominate analytical workloads — row groups, column chunks, encoding schemes, and how predicate pushdown works at the file level.
03
Data Partitioning Strategies
Master how partitioning by date, region, or other keys reduces scan volume, and understand the trade-offs between too many small files and too few large ones.
04
Data Warehouse vs. Data Lake
Compare the traditional data warehouse model (schema-on-write, structured) with the data lake approach (schema-on-read, multi-format) and understand where each excels.

§ SECTION 02 · DATA LAKE ARCHITECTURE

01
Lake Zones: Raw, Curated & Consumption
Learn the multi-zone pattern — landing raw data, cleaning it into curated layers, and serving aggregated datasets for consumption — and why this layering prevents chaos.
02
Batch & Streaming Ingestion
Understand the key ingestion modes — scheduled batch loads, CDC streams, and micro-batch — and when to use each for landing data into a lake.
03
Schema Evolution & Enforcement
Learn how schemas drift over time in a lake, the problems this creates, and the mechanisms (schema registries, merge rules) that keep data compatible.
04
Metadata & Data Catalogs
Understand how tools like AWS Glue Catalog, Hive Metastore, and Unity Catalog track table locations, schemas, and statistics so engines can discover and query lake data.
05
Data Swamp Anti-Patterns
Identify the common failures that turn a data lake into an unusable swamp — missing metadata, no ownership, unbounded schema drift, and lack of quality checks.

§ SECTION 03 · OPEN TABLE FORMATS

01
What Open Table Formats Solve
Understand the core problem — ACID transactions, time travel, and efficient upserts on top of immutable object storage — and why plain Parquet directories aren't enough.
02
Delta Lake Deep Dive
Learn Delta Lake's transaction log, optimistic concurrency, Z-ordering, VACUUM, and MERGE semantics. Understand its tight Spark integration and Databricks ecosystem.
03
Apache Iceberg Deep Dive
Explore Iceberg's snapshot-based metadata tree, hidden partitioning, partition evolution, and multi-engine compatibility across Spark, Trino, Flink, and more.
04
Apache Hudi Overview
Understand Hudi's copy-on-write vs. merge-on-read table types, its record-level indexing, incremental queries, and strengths for CDC-heavy workloads.
05
Comparing Delta, Iceberg & Hudi
Evaluate the three major table formats side by side on concurrency, ecosystem support, partition evolution, and community momentum to make an informed choice.

§ SECTION 04 · LAKEHOUSE ARCHITECTURE

01
The Lakehouse Paradigm
Understand how the lakehouse combines the low-cost storage of a data lake with the reliability, performance, and governance features of a data warehouse.
02
Medallion Architecture (Bronze/Silver/Gold)
Learn the medallion pattern for incrementally refining data quality — bronze for raw ingestion, silver for validated and conformed data, gold for business-level aggregates.
03
Query Engines for Lakehouses
Survey the engines that query lakehouse tables — Spark, Trino/Presto, Dremio, StarRocks — and understand the trade-offs between throughput, latency, and cost.
04
SQL Analytics on the Lakehouse
Learn how to run interactive SQL over lakehouse tables, including BI tool integration, caching layers, and materialized views for sub-second dashboards.

§ SECTION 05 · GOVERNANCE, SECURITY & DATA QUALITY

01
Fine-Grained Access Control
Understand column-level and row-level security, role-based policies, and how catalogs like Unity Catalog or Apache Ranger enforce permissions across multiple engines.
02
Data Quality & Expectations
Learn to define and enforce data quality rules — null checks, range validations, referential integrity — using tools like Great Expectations or Delta Live Tables expectations.
03
Data Lineage & Observability
Trace how data flows from source to consumption, detect stale or broken pipelines, and integrate lineage tools like OpenLineage and DataHub into your lakehouse.

§ SECTION 06 · PERFORMANCE & OPTIMIZATION

01
Compaction & Small File Problem
Understand why streaming ingestion creates many tiny files, how compaction (bin-packing) merges them, and how to schedule maintenance without disrupting readers.
02
Data Skipping & Clustering
Learn how min/max statistics, bloom filters, Z-ordering, and liquid clustering let engines skip irrelevant files and dramatically reduce I/O.
03
Caching & Query Acceleration
Explore local SSD caching, result caching, and acceleration layers (Alluxio, Delta Cache) that bridge the gap between object store latency and interactive query expectations.

§ SECTION 07 · REAL-TIME & ADVANCED PATTERNS

01
Streaming into the Lakehouse
Learn how Spark Structured Streaming, Flink, and Kafka Connect write directly into Delta/Iceberg tables with exactly-once semantics and low-latency availability.
02
Change Data Capture Pipelines
Build CDC pipelines that replicate operational databases into lakehouse tables using Debezium, Airbyte, or native connectors, keeping the lake in near-real-time sync.
03
ML & Feature Engineering on the Lakehouse
Understand how to use lakehouse tables as feature stores, run distributed training on lake data, and version ML datasets alongside model artifacts.

Data Lakes & Lakehouses

§ SYLLABUS