Apache Spark and Scala
Module 5: Spark and Big Data
© 2015 BlueCamphor Technologies (P) Ltd.
Course Topics
Module 1 Module 2 Module 3 Module 4
Getting Started / Scala – Essentials and Introducing Traits and Functional Programming
Introduction to Scala Deep Dive OOPS in Scala in Scala
Module 5 Module 6 Module 7 Module 8
Spark and Big Data Advanced Spark Understanding RDDs Shark, SparkSQL and
Concepts Project Discussion
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 2
Session Objectives
In this session, you will understand:
ᗍ Analyze Batch Processing and Real-time Processing
ᗍ Understand Spark Ecosystem
ᗍ Analyze MapReduce Limitations
ᗍ Go through Spark History
ᗍ Analyze Spark Architecture
ᗍ Understand Spark and Hadoop Advantages
ᗍ Analyze benefits of Spark and Hadoop combined
ᗍ Install Spark
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 3
Bombay Stock Exchange – Big Data Case Study
ᗍ When Bombay Stock Exchange (the seventh largest stock exchange in the world, in terms of market capitalization)
wanted to ramp up / scale up its operations, the company faced major challenges
ᗍ These challenges were in terms of exponential growth of data (read big data), need for complex analytics and
managing information that was scattered across multiple and monolithic system
ᗍ DataMetica (a Mumbai / Pune based big data organization) suggested a 3 phased solution to BSE:
• In the first phase, they created a POC which demonstrated how a Hadoop based Big data implementation can
work for BSE
• In the second phase, they worked with BSE to pick up the most critical business use cases (which had the
maximum ROI for BSE) and implemented them
• Finally in the third phase they delivered the complete solution in a multi-faced manner for a full fledged
implementation
ᗍ That’s how Hadoop got implemented at BSE in a cost effective and scalable fashion
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 4
Batch Processing Phase / Life Cycle
ᗍ Processing transactions in a group or batch
ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type of
data (structured or unstructured)
Data Collection Data Preparation Data Presentation
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 5
Data Collection
Real Time System Business
Analytics / Batch
Flume Processing
System
Unstructured
Data
Sqoop
Structured
Data
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 6
Data Preparation
Business
Analytics / Batch
Processing Pig
System
Data Processing
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 7
Data Presentation
Business
Analytics / Batch Pig
Processing
System
Data Processing
Output
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 8
Real Time Analytics Examples – WindyGrid
ᗍ City of Chicago uses a MongoDB based Real Time Analytics Platform called WindyGrid
ᗍ This platform integrates unstructured data from various city departments to predict co-relations and outcomes in a
proactive manner. E.g. How a rodent complaint will follow well within 7 days of a garbage complaint
WindyGrid in Practice
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 9
Real Time Analytics Examples – WindyGrid (Cont’d)
ᗍ With MongoDB based system, WindyGrid created a central nervous system for Chicago, helping improve services,
cut costs, and create a more livable city
ᗍ By pulling together 311 and 911 calls, tweets, and bus locations, the city can better manage traffic and incidents
and get streets cleaned and opened up more quickly
ᗍ The city of Chicago collects more than seven million rows of data every day. With MongoDB’s flexible data schema,
This system doesn’t need to worry about unwieldy and constantly changing schema requirements
The next step for WindyGrid is building an open-source, predictive analytics system called the SmartData Platform to
anticipate problems before they occur, and propose solutions in an even faster manner
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 10
What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets across
clusters of commodity computers using a simple programming mode
It is an Open-source Data Management with scale-out storage and distributed processing
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 11
Hadoop Key Characteristics
Reliable
Scalable Characteristics Economical
Flexible
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 12
What is Spark?
ᗍ Apache Spark is a fast and general engine for large-scale data processing
ᗍ Apache Spark is a general-purpose cluster in-memory computing system
ᗍ It is used for fast data analytics
ᗍ It abstracts APIs in Java, Scala and Python, and provides an optimized engine that supports general execution
graphs
ᗍ Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and
more
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 13
Spark Ecosystem
Aplha/Pre-alpha
BlindDB
(Approximate
SQL)
Spark MLLib GraphX SparkR
SQL Streaming (Machine (Graph (R on
(Streaming) learning) Computation) Spark)
Spark Core Engine
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 14
Spark Ecosystem (Cont’d)
An approximate Aplha/Pre-alpha
query engine. To
run over Core
Spark Engine Enables analytical
and interactive apps Package for R language
for live streaming Graph Computation to enable R-users to
BlindDB data engine leverage Spark power
(Similar to Graph) from R shell
(Approximate
SQL)
Used for structured Spark MLLib GraphX SparkR
data. Can run SQL Streaming (Machine (Graph (R on
unmodified hive (Streaming) learning) Computation) Spark)
queries on existing
Hadoop deployment
Spark Core Engine
Machine learning library being built on top of Spark. Provision for support to many machine
learning algorithms with speeds upto 100 times faster than Map-Reduce
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 15
Spark Ecosystem (Cont’d)
ᗍ Spark Core Engine
• The core engine for entire Spark framework
• Provides utilities and architecture for other components
ᗍ Spark SQL
• Spark SQL is the newest component of Spark and provides a SQL like interface
• Used for Structured data
• Can expose many datasets as tables
• Spark SQL is tightly integrated with the various spark programming languages like hive
ᗍ Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
• A good alternative of Storm
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 16
Spark Ecosystem (Cont’d)
ᗍ BlinkDB
• An approximate query engine. To run over Core Spark Engine
• Accuracy trade-off for response time
ᗍ MLLib
• Machine learning library being built on top of Spark
• Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-
Reduce
• Mahout is also being migrated to MLLib
ᗍ GraphX
• Graph Computation engine (Similar to Giraph)
• Combines data-parallel and graph-parallel concepts
ᗍ SparkR
• Package for R language to enable R-users to leverage Spark power from R shell
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 17
Why Spark?
ᗍ Spark exposes a simple programming layer which provides powerful caching and disk persistence capabilities
ᗍ The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager
ᗍ Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and
Python supported)
ᗍ Has super active community
ᗍ Spark fits well with existing Hadoop ecosystem
• Can be launched in existing YARN Cluster
• Can fetch the data from Hadoop 1.0
• Can be integrated with Hive
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 18
Brief History: M/R Limitations
ᗍ Map Reduce is a very powerful programming paradigm, but it has some limitations:
• Difficult to Program an algorithm directly in Native Map Reduce
• Performance bottlenecks, specifically for small batch not fitting the use cases
• Many categories of algorithms not supported (e.g. iterative algorithms, asynchronous algorithms etc.)
ᗍ In short, MR doesn’t compose well for large applications
ᗍ We are forced to take “hybrid” approaches many times
ᗍ Therefore, many specialized systems evolved over a period of time as workarounds
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 19
Brief History: Evolution of Specialized Systems
Pregel GIraph
Dremel Drill Tez
MapReduce
Impala GraphLab
Storm S4
General Batch Processing Specialized Systems
Iterative, interactive, streaming, graph, etc.
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 20
Brief History: Spark
ᗍ Unlike other evolved specialized systems, Spark’s design goal is to generalize Map Reduce concept to support new
apps within same engine
ᗍ Two reasonably small additions are enough to express the previous models:
• Fast data sharing (For Faster Processing)
• General DAGs (For Lazy Processing)
ᗍ This allows for an approach which is more efficient for the engine, and much simpler for the end users
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 21
Brief History: Spark Key Points
Code Size
140000
120000
100000 GraphX
80000 Shark
60000 Streaming
40000
20000
0
Used as libs, instead of
Non-test, non example source lines *also calls into Hive specialized systems
The State of Spark, and where we’re going next
Matei Zaharia
Spark Summit(2013)
you.be/nU6v02EJAb4
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 22
Brief History: Spark Key Points (Cont’d)
RDD Fault Tolerance
RDDs track the series of transformation used to build them (their lineage) to recomputed lost data
Example:
messages=textFile(…).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDD FilteredRDD MapperRDD
Path=hdfs:// Func=_.contains(…) Func=_.split(…)
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 23
Spark in Industry
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 24
Spark Advantages
ᗍ Easier APIs EASE OF DEVELOPMENT
ᗍ Python, Scala, Java
ᗍ RDDs
IN-MEMORY PERFORMANCE
ᗍ DAGs Unify Processing
ᗍ SQL, ML, Streaming, COMBINE WORKFLOWS
GraphX
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 25
Spark + Hadoop
Operational Applications Augmented by In-Memory Performance
Spark Hadoop
UNLIMITED EASE OF
SCALE DEVELOPMENT
IN-MEMORY The Combination of Spark on ENTERPRISE
PERFORMANCE PLATFORM
Hadoop
WIDE RANGE OF COMBINE
APPLICATIONS WORKFLOWS
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 26
Spark Cluster Manager
Worker Node
Executor Cache
Task Task
Driver Program
SparkContext Cluster Manager
Worker Node
Executor Cache
Task Task
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 27
Spark Architecture – SparkContext
ᗍ Spark apps run as separate set of process on a cluster
ᗍ All of the distributed process is coordinated by SparkContext object in the driver program
ᗍ SparkContext object then connects to one type of cluster Manager (Standalone/Yarn/Mesos) for resource allocation
across cluster
ᗍ Cluster Managers provide Executors, which are essentially JVM process to run the logic and store app data
ᗍ Then, the SparkContext object sends the application code ( jar files/python scripts) to executors
ᗍ Finally, the SparkContext executes tasks in each executor
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 28
SBT Demo
ᗍ Sbt stands for Scala Build Tool
ᗍ A build tool provides facility to compile, run, test, package your projects
ᗍ SBT is a modern build tool. While it is written in Scala and provides many Scala conveniences, it is a general
purpose build tool
Why SBT?
ᗍ Full Scala language support for creating tasks
ᗍ Continuous command execution
ᗍ Launch REPL in project context
Note: Given SBT installation guide separately
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 29
SBT Demo (Cont’d)
SBT console:
Testing in the Console:
ᗍ SBT can be used both as a command line script and as a build console
ᗍ We’ll be primarily using it as a build console, but most commands can be run standalone by passing the command
as an argument to SBT, e.g.
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 30
SBT Demo (Cont’d)
SBT allows you to start a Scala REPL with all your project dependencies loaded. It compiles your project source before
launching the console, providing us a quick way to bench test our parser
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 31
Simple Spark Apps: Word Count
This simple program provides a good test case for parallel processing, since it:
ᗍ Requires a minimal amount of code
ᗍ Demonstrates use of both symbolic and numeric values
ᗍ Isn’t many steps away from search indexing
val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_out")
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 32
Using Hadoop as Storage
ᗍ Spark can use Hadoop as Storage
• Spark is NOT limited to HDFS only for it’s storage needs
• HDFS provides distributed storage of large datasets
• High Availability is assured natively through HDFS
• No extra software installation is required
• Compatible with Hadoop 1.x also. Using HDFS as storage doesn’t require Hadoop 2.x
• Data Loss during computation is handled by HDFS itself
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 33
Using Hadoop as Execution Engine
ᗍ Spark can use Hadoop as execution engine
• Spark can be integrated with Yarn for it’s execution
• Spark can be used with other engines (like Mesos, Spark Clsuter manager) also
• Yarn integration automatically provides processing scalability to Spark
• Spark needs Hadoop 2.0+ versions in order to use it for execution
• Every node in Hadoop cluster need Spark also to be installed
• Using Hadoop cluster for Spark processes, requires RAM upgrading of data nodes
• The integration distribution of Spark is quite new and still in the process of stabilization
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 34
Questions
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 35