
DENG-254: Preparing with Cloudera Data Engineering and Apache Spark
This hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP). Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components. Participants will learn how to use Spark SQL to query structured data, how to use Hive features to ingest and denormalize data, and how to work with “big data” stored in a distributed file system.After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.
During this course, you will learn how to:Distribute, store, and process data in a CDP clusterWrite, configure, and deploy Apache Spark applicationsUse the Spark interpreters and Spark applications to explore, process, and analyze distributed dataQuery data using Spark SQL, DataFrames, and Hive tablesDeploy a Spark application on the Data Engineering Service
HDFS IntroductionHDFS Overview.HDFS Components and Interactions.Additional HDFS Interactions.Ozone Overview.Exercise: Working with HDFS.YARN IntroductionYARN Overview.YARN Components and Interaction.Working with YARN.Exercise: Working with YARN.Working with RDDsResilient Distributed Datasets (RDDs).Exercise: Working with RDDs.Working with DataFramesIntroduction to DataFrames.Exercise: Introducing DataFrames.Exercise: Reading and Writing DataFrames.Exercise: Working with Columns.Exercise: Working with Complex Types.Exercise: Combining and Splitting DataFrames.Exercise: Summarizing and Grouping DataFrames.Exercise: Working with UDFs.Exercise: Working with Windows.Introduction to Apache Hive.About Hive.Transforming data with Hive QL.Working with Apache HiveExercise: Working with Partitions.Exercise: Working with Buckets.Exercise: Working with Skew.Exercise: Using Serdes to Ingest Text Data.Exercise: Using Complex Types to Denormalize Data.Hive and Spark IntegrationHive and Spark Integration.Exercise: Spark Integration with Hive.Distributed Processing ChallengesShuffle.Skew.Order.Spark Distributed ProcessingSpark Distributed Processing.Exercise: Explore Query Execution Order.Spark Distributed PersistenceDataFrame and Dataset Persistence.Persistence Storage Levels.Viewing Persisted RDDs.Exercise: Persisting DataFrames.Data Engineering ServiceCreate and Trigger Ad-Hoc Spark Jobs.Orchestrate a Set of Jobs Using Airflow.Data Lineage using Atlas.Auto-scaling in Data Engineering Service.Workload XMOptimize Workloads, Performance, Capacity.Identify Suboptimal Spark Jobs.Appendix: Working with Datasets in ScalaWorking with Datasets in Scala.Exercise: Using Datasets in Scala.
This course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.



