A comprehensive learning path covering data lake architectures, lakehouse patterns, and the modern open table format ecosystem. You'll progress from foundational storage concepts through Delta Lake, Apache Iceberg, and Hudi, to production-grade lakehouse design with governance, performance tuning, and real-time ingestion.
§ SYLLABUS
- 01Distributed Storage Fundamentals
Understand how distributed file systems like HDFS and cloud object stores (S3, ADLS, GCS) work — block vs. object storage, eventual consistency, and how data is physically organized.
- 02Columnar File Formats
Learn why columnar formats like Parquet and ORC dominate analytical workloads — row groups, column chunks, encoding schemes, and how predicate pushdown works at the file level.
- 03Data Partitioning Strategies
Master how partitioning by date, region, or other keys reduces scan volume, and understand the trade-offs between too many small files and too few large ones.
- 04Data Warehouse vs. Data Lake
Compare the traditional data warehouse model (schema-on-write, structured) with the data lake approach (schema-on-read, multi-format) and understand where each excels.
- 01Lake Zones: Raw, Curated & Consumption
Learn the multi-zone pattern — landing raw data, cleaning it into curated layers, and serving aggregated datasets for consumption — and why this layering prevents chaos.
- 02Batch & Streaming Ingestion
Understand the key ingestion modes — scheduled batch loads, CDC streams, and micro-batch — and when to use each for landing data into a lake.
- 03Schema Evolution & Enforcement
Learn how schemas drift over time in a lake, the problems this creates, and the mechanisms (schema registries, merge rules) that keep data compatible.
- 04Metadata & Data Catalogs
Understand how tools like AWS Glue Catalog, Hive Metastore, and Unity Catalog track table locations, schemas, and statistics so engines can discover and query lake data.
- 05Data Swamp Anti-Patterns
Identify the common failures that turn a data lake into an unusable swamp — missing metadata, no ownership, unbounded schema drift, and lack of quality checks.
- 01What Open Table Formats Solve
Understand the core problem — ACID transactions, time travel, and efficient upserts on top of immutable object storage — and why plain Parquet directories aren't enough.
- 02Delta Lake Deep Dive
Learn Delta Lake's transaction log, optimistic concurrency, Z-ordering, VACUUM, and MERGE semantics. Understand its tight Spark integration and Databricks ecosystem.
- 03Apache Iceberg Deep Dive
Explore Iceberg's snapshot-based metadata tree, hidden partitioning, partition evolution, and multi-engine compatibility across Spark, Trino, Flink, and more.
- 04Apache Hudi Overview
Understand Hudi's copy-on-write vs. merge-on-read table types, its record-level indexing, incremental queries, and strengths for CDC-heavy workloads.
- 05Comparing Delta, Iceberg & Hudi
Evaluate the three major table formats side by side on concurrency, ecosystem support, partition evolution, and community momentum to make an informed choice.
- 01The Lakehouse Paradigm
Understand how the lakehouse combines the low-cost storage of a data lake with the reliability, performance, and governance features of a data warehouse.
- 02Medallion Architecture (Bronze/Silver/Gold)
Learn the medallion pattern for incrementally refining data quality — bronze for raw ingestion, silver for validated and conformed data, gold for business-level aggregates.
- 03Query Engines for Lakehouses
Survey the engines that query lakehouse tables — Spark, Trino/Presto, Dremio, StarRocks — and understand the trade-offs between throughput, latency, and cost.
- 04SQL Analytics on the Lakehouse
Learn how to run interactive SQL over lakehouse tables, including BI tool integration, caching layers, and materialized views for sub-second dashboards.
- 01Fine-Grained Access Control
Understand column-level and row-level security, role-based policies, and how catalogs like Unity Catalog or Apache Ranger enforce permissions across multiple engines.
- 02Data Quality & Expectations
Learn to define and enforce data quality rules — null checks, range validations, referential integrity — using tools like Great Expectations or Delta Live Tables expectations.
- 03Data Lineage & Observability
Trace how data flows from source to consumption, detect stale or broken pipelines, and integrate lineage tools like OpenLineage and DataHub into your lakehouse.
- 01Compaction & Small File Problem
Understand why streaming ingestion creates many tiny files, how compaction (bin-packing) merges them, and how to schedule maintenance without disrupting readers.
- 02Data Skipping & Clustering
Learn how min/max statistics, bloom filters, Z-ordering, and liquid clustering let engines skip irrelevant files and dramatically reduce I/O.
- 03Caching & Query Acceleration
Explore local SSD caching, result caching, and acceleration layers (Alluxio, Delta Cache) that bridge the gap between object store latency and interactive query expectations.
- 01Streaming into the Lakehouse
Learn how Spark Structured Streaming, Flink, and Kafka Connect write directly into Delta/Iceberg tables with exactly-once semantics and low-latency availability.
- 02Change Data Capture Pipelines
Build CDC pipelines that replicate operational databases into lakehouse tables using Debezium, Airbyte, or native connectors, keeping the lake in near-real-time sync.
- 03ML & Feature Engineering on the Lakehouse
Understand how to use lakehouse tables as feature stores, run distributed training on lake data, and version ML datasets alongside model artifacts.