KEMBAR78
Developer Training For Apache Spark and Hadoop | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
58 views3 pages

Developer Training For Apache Spark and Hadoop

Uploaded by

ks712139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views3 pages

Developer Training For Apache Spark and Hadoop

Uploaded by

ks712139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Developer Training for Apache Spark and Hadoop

Course Outcomes:
• Distribute, store, and process data in a Hadoop cluster
• Write, configure, and deploy Spark applications on a cluster
• Use the Spark shell for interactive data analysis
• Process and query structured data using Spark SQL and Hive Query Language
• Understand a wide variety of learning algorithms and build an end-to-end Machine
Learning Model with MLlib in pySpark.
• Use Spark Streaming to process a live data stream

What to Expect
This course is designed for developers and engineers who have programming experience, but
prior knowledge of Hadoop and/or Spark is not required.
• Apache Spark examples and hands-on exercises are presented in Scala and Python. The
ability to program in one of those languages is required.
• Basic familiarity with the Linux command line is assumed
• Basic knowledge of SQL is helpful

Course Duration: 64 Hours

Module 1 Module 3
Introduction to Apache Hadoop Distributed Processing on an Apache Hadoop
and the Hadoop Ecosystem Cluster
• Apache Hadoop Overview • YARN Architecture
• Data Ingestion and Storage • Working With YARN
• Data Processing
• Data Analysis and Exploration Module 4
• Other Ecosystem Tools Apache Spark Basics
• Introduction to the Hands-On Exercises • What is Apache Spark?
• Starting the Spark Shell
Module 2 • Using the Spark Shell
Apache Hadoop File Storage • Getting Started with Datasets and DataFrames
• Apache Hadoop Cluster Components • DataFrame Operations
• HDFS Architecture
• Using HDFS

1
Module 5 Module 11
Working with DataFrames and Schemas Transforming Data with RDDs
• Introduction to DataFrames • Writing and Passing Transformation
• Exercise: Introducing DataFrames Functions
• Exercise: Reading and Writing DataFrames • Transformation Execution
• Exercise: Working with Columns • Converting Between RDDs and DataFrames
• Exercise: Working with Complex Types
• Exercise: Combining and Splitting DataFrames Module 12
• Exercise: Summarizing and Grouping DF Aggregating Data with Pair RDDs
• Exercise: Working with UDFs • Key-Value Pair RDDs
• Exercise: Working with Windows • Map-Reduce
• Eager and Lazy Execution • Other Pair RDD Operations

Module 6 Module 13
Analyzing Data with DataFrame Queries Querying Tables and Views with Apache
• Querying DataFrames Using Column Exp. Spark SQL
• Grouping and Aggregation Queries • Querying Tables in Spark Using SQL
• Joining DataFrames • Querying Files and Views
• The Catalog API
Module 7 • Comparing Spark SQL, Apache Impala,
Introduction to Apache Hive and Apache Hive-on-Spark
• About Hive
• Transforming data with Hive QL Module 14
Working with Datasets in Scala
Module 8 • Datasets and DataFrames
Working with Apache Hive • Creating Datasets
• Exercise: Working with Partitions • Loading and Saving Datasets
• Exercise: Working with Buckets • Dataset Operations
• Exercise: Working with Skew
• Exercise: Using Serdes to Ingest Text Data Module 15
• Exercise: Using Complex Types to Denormalize Writing, Configuring, and Running Apache
Data Spark Applications
• Writing a Spark Application
Module 9 • Building and running an application
Hive and Spark Integration • Application Deployment Mode
• Hive and Spark Integration • The Spark Application Web UI
• Exercise: Spark Integration with Hive • Configuring Application Properties

Module 10
RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations

2
Module 16 • ML model with Spark ML
Distributed Processing • Exercise: Implement Linear regression
• Review: Apache Spark on a Cluster • Exercise: Implement logistic regression
• RDD Partitions • Exercise: Implement Random Forest
• Example: Partitioning in Queries • Exercise: Implement k-means
• Stages and Tasks
• Job Execution Planning Module 20
• Example: Catalyst Execution Plan Apache Spark Streaming: Introduction to
• Example: RDD Execution Plan DStreams
• Apache Spark Streaming Overview
Module 17 • Example: Streaming Request Count
Distributed Processing Challenges • DStreams
• Shuffle • Developing Streaming Applications
• Skew
• Order Module 21
Apache Spark Streaming: Processing Multiple
Module 18 Batches
Distributed Data Persistence • Multi-Batch Operations
• DataFrame and Dataset Persistence • Time Slicing
• Persistence Storage Levels • State Operations
• Viewing Persisted RDDs • Sliding Window Operations
• Preview: Structured Streaming
Module 19
Machine Learning with Spark ML Module 22
• Common Apache Spark Use Cases Apache Spark Streaming: Data Sources
• Iterative Algorithms in Apache Spark: Machine • Streaming Data Source Overview
Learning, Graph Processing • Apache Flume and Apache Kafka Data Sources
• Introduction to MLlib- Various ML algorithms • Example: Using a Kafka Direct Data Source
supported by Mlib

You might also like