Programming in Spark using PySpark

Programming in Spark using PySpark
Mostafa Elzoghbi
Sr. Technical Evangelist – Microsoft
@MostafaElzoghbi
http://mostafa.rocks

Session Objectives & Takeaways
• Programming Spark
• Spark Program Structure
• Working with RDDs
• Transformations versus Actions
• Lambda, Shared Variables (Broadcast vs accumulators)
• Visualizing big data in Spark
• Spark in the cloud (Azure)
• Working with cluster types, notebooks, scaling.

Python Spark (pySpark)
• We are using the Python programming interface to Spark (pySpark)
• pySpark provides an easy-to-use programming abstraction and parallel
runtime:
“Here’s an operation, run it on all of the data”
• RDDs are the key concept

Apache Spark Driver and
Workers
• A Spark program is two programs:
• A driver program and a workers program
• Worker programs run on cluster nodes or in
local threads
• RDDs (Resilient Distributed Datasets) are
distributed

Spark Essentials: Master
• The master parameter for a SparkContext determines which type and size of
cluster to use

Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext
• Use SparkContext to create RDDs

Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system

RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in
parallel.
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Creating an RDD
• Create RDDs from Python collections (lists)
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles,
any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*

Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count

Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right away –
• instead Spark remembers set of transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
• Think of this as a recipe for creating result

Python lambda Functions
• Small anonymous functions (not bound to a name)
lambda a, b: a+b
» returns the sum of its two arguments
• Can use lambda functions wherever function objects are required
• Restricted to a single expression

Spark Actions
• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark

Spark Program Lifecycle
1. Create RDDs from external data or parallelize a collection in your driver
program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse -- IMPORTANT
4. Perform actions to execute parallel
5. Computation and produce results

pySpark Shared Variables
• Broadcast Variables
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes
At the driver: broadcastVar = sc.broadcast([1, 2, 3])
At a worker: broadcastVar.value

• Accumulators
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers
>>> accum = sc.accumulator(0)
>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(x):
>>> global accum
>>> accum += x
>>> rdd.foreach(f)
>>> accum.value
Value: 10

Visualizing Big Data in the browser
• Challenges:
• Manipulating large data can take long time
Memory: caching -> Scale clusters
CPU: Parallelism -> Scale clusters
• We have more data points than possible pixels
> Summarize: Aggregation, Pivoting (more data than pixels)
> Model (Clustering, Classification, D. Reduction, …etc)
> Sample: approximate (faster) and exact sampling
• Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.

Spark Kernels and MAGIC keywords
• PySpark kernel supports set of %%MAGIC keywords
• It supports built-in IPython built-in magics, including %%sh.
• Auto visualization
• Magic keywords:
• %%SQL % Spark SQL
• %%lsmagic % List all supported magic keywords (Important)
• %env % Set environment variable
• %run % Execute python code
• %who % List all variables of global scope
• Run code from a different kernel in a notebook.

Spark in Azure
Hadoop clusters in Azure are packaged under “HDInsight” service

Spark in Azure
• Create clusters in few clicks
• Apache Spark comes only in Linux OS.
• Multiple HDP versions
• Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets.
• Multiple Storage options:
• Azure Storage
• ADL store
• External metadata store in SQL server database for
Hive and Oozie.
• All notebooks are stored in the storage account
associated with Spark cluster
• Zeppelin notebook is available on certain Spark
versions but not all.

Programming Spark Apps in HDInsight
• Supports four kernels in Jupyter in HDInsight Spark clusters in Azure

References
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Visualizations for Databricks
https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20
Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov
erview.html
• SPARKHub by Databricks
https://sparkhub.databricks.com/resources/

Thank you
• Check out my blog big data articles: http://mostafa.rocks
• Follow me on Twitter: @MostafaElzoghbi
• Want some help in building cloud solutions? Contact me to know more.

Programming in Spark using PySpark

In this document