
DENG-255: Building an Open Data Lakehouse using Apache Iceberg
The Open Data Lakehouse is a modern data architecture that enables versatile analytics on streaming and stored data within cloud-native object stores. This architecture can span hybrid and multi-cloud environments.This course introduces Apache Ozone, a hybrid storage service addressing the limitations of HDFS. You'll also explore Apache Iceberg, an open-table format optimized for petabyte-scale datasets. The course covers Iceberg's benefits, architecture, read/write operations, streaming, and advanced features like time travel, partition evolution, and Data-as-Code. Over 25 hands-on labs and a capstone project will equip you with the skills to build an efficient, performant Open Data Lakehouse within your own environment.
This course teaches participants the following skills:Gain a deep understanding of Iceberg's benefits, snapshots, and their functionalities.Confidently build external and managed tables, configuring copy-on-write and merge-on-read for optimized data management.Perform rollbacks and time travel, navigate schema and partition evolution, and utilize hidden partitions.Create and merge table branches, mastering Iceberg's write-audit-publish procedure.Efficiently perform table maintenance tasks and tackle data migration challenges.
Open Data Lakehouse FundamentalsUnderstand core Open Data Lakehouse concepts and benefits.Introduction to Apache Ozone and its Integration within the CDP EcosystemApache Ozone MasteryConfigure Ozone, use CLI commands, and transfer data between HDFS and Ozone.Integrate Ozone into applications.Apache Iceberg ExpertiseExplore Iceberg's integration with CDP, architecture, and data lakehouse design principles.Master data management, governance, and optimization best practices.Understand snapshots and time travel queries.Design tables strategically (external/managed, copy-on-write, merge-on-read).Employ advanced features: change data capture (CDC), schema/partition evolution, hidden partitions.Data-as-Code and ComplianceImplement zero-copy cloning, table branching, and tagging for QA, ML models, and auditing.Optimize ETL/ELT data loading and achieve GDPR compliance with Iceberg's write-audit-publish (WAP).Hive to Iceberg MigrationUnderstand catalog differences and migration strategies.Manage late-arriving data effectively.Iceberg AdministrationPerform table maintenance tasks.Configure and manage access control settings.Capstone ProjectApply all concepts by implementing an Open Data Lakehouse use case in CDP.Develop a comprehensive Open Data Lakehouse implementation runbook.
This course is designed for data professionals within organizations using Cloudera Data Warehouse or Cloudera Data Engineering solutions. If you're building an Open Data Lakehouse powered by Apache Iceberg, this course will provide the knowledge and skills you need. Ideal roles include Data Engineers, Hive/Impala SQL Developers, Kafka Streaming Engineers, Data Scientists, and CDP Admins. A basic understanding of HDFS and experience with Hive and Spark are prerequisites.



