KEMBAR78
Introduction to Spark with Python | PDF
WITH
Gökhan Atıl
GÖKHAN ATIL
➤ Database Administrator
➤ Oracle ACE Director (2016)

ACE (2011)
➤ 10g/11g and R12 Oracle Certified Professional (OCP)
➤ Co-author of Expert Oracle Enterprise Manager 12c
➤ Founding Member and Vice President of TROUG
➤ Blogger (since 2008) gokhanatil.com
➤ Twitter: @gokhanatil
2
APACHE SPARK WITH PYTHON
➤ Introduction to Apache Spark
➤ Why Python (PySpark) instead of Scala?
➤ Spark RDD
➤ SQL and DataFrames
➤ Spark Streaming
➤ Spark Graphx
➤ Spark MLlib (Machine Learning)
3
INTRODUCTION TO APACHE SPARK
➤ A fast and general engine for large-scale data processing
➤ Top-Level Apache Project since 2014.
➤ Response to limitations in the MapReduce
➤ Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk
➤ Implemented in Scala programming language, supports
Java, Scala, Python, R
➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud
4
DOWNLOAD AND RUN ON YOUR PC
➤ https://spark.apache.org/downloads.html
➤ Extract and Spark is ready:
tar -xzf spark-2.3.0-bin-hadoop2.7.tgz
spark-2.3.0-bin-hadoop2.7/bin/spark
➤ You can also use PIP:
pip install pyspark
5
PYSPARK AND SPARK-SUBMIT
➤ PySpark is the interface that gives access to Spark using the
Python programming language
➤ The spark-submit script in Spark’s bin directory is used to
launch applications on a cluster
spark-submit example1.py
6
WHY PYTHON INSTEAD OF SCALA?
➤ If you know Scala, then use Scala!
➤ Learning curve: Python is comparatively easier to learn
➤ Easy to use: Code readability, maintainability and familiarity is
far better with Python
➤ Libraries: Python comes with great libraries for data analysis,
statistics and visualization (numpy, pandas, matplotlib etc...)
➤ Performance:  Scala is faster then Python but if your Python
code just calls Spark libraries, the differences in performance is
minimal (*)
7
Reminder: Any new feature added in Spark API will be
available in Scala first
RESILIENT DISTRIBUTED DATASET (RDD)
➤ RDDs are the core data structure in Spark
➤ Distributed, resilient, immutable, can store unstructured and
structured data, lazy evaluated
8
node 1
RDD
partition 1
node 2
RDD
partition 2
node 3
RDD
partition 3
RDD
RDD TRANSFORMATIONS AND ACTIONS
sc.textFile(*)
9
RDD T1 T2 T3 ACTION
LAZY EVALUATATION
SPARK CONTEXT
.collect().reduceByKey(*).filter(*).map(*)
TRANSFORMATIONS ACTIONS
➤ map
➤ filter
➤ flatMap
➤ mapPartitions
➤ reduceByKey
➤ union
➤ intersection
➤ join
10
➤ collect
➤ count
➤ first
➤ take
➤ takeSample
➤ takeOrdered
➤ saveAsTextFile
➤ foreach
HOW TO CREATE RDD IN PYSPARK
➤ Referencing a dataset in an external storage system:
rdd = sc.textFile( ... )
➤ Parallelizing already existing collection:
rdd = sc.parallelize( ... )
➤ Creating RDD from already existing RDDs:
rdd2 = rdd1.map( ... )
11
USERS.CSV (MOVIELENS DATABASE)
id | age | gender | occupation | zip
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
12
M = 670
F = 273
EXAMPLE #1: USE RDD TO GROUP DATA FROM CSV
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
print sc.textFile( "users.csv" ) 
.map( lambda x: (x.split("|")[2], 1) ) 
.reduceByKey(lambda x,y:x+y).collect()
sc.stop()
13
M, 1
M, 1
F, 1
M, 1
[(u'M', 670), (u'F', 273)]
SPARK SQL AND DATAFRAMES
14
Catalyst
RDD
DataFrames/DataSetsSQL
SPARKSQL
MLlib GraphFrames
Structured
Streaming
➤ Spark SQL is Apache Spark's module for working with
structured data
DATAFRAMES AND DATASETS
➤ DataFrame is a distributed collection of "structured" data, organized
into named columns.
➤ Spark DataSets are statically typed, while Python is a dynamically
typed programming language so Python supports only DataFrames.
15
EXAMPLE #2: USE DATAFRAME TO GROUP DATA FROM CSV
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.groupby( "gender" ).count().show()
sc.stop()
16
DATAFRAME VERSUS RDD
17
?
CATALYST OPTIMIZER
➤ Spark SQL uses Catalyst optimizer to optimize query plans.
➤ Supports cost-based optimization since Spark 2.2
18
SQL
DataFrame
DataSet
Query
Plan
Optimized
Query Plan
RDD
Code Generation
CONVERSION BETWEEN RDD AND DATAFRAME
➤ An RDD can be converted to DataFrame using
createDataFrame or toDF method:
rdd = sc.parallelize([("osman",21),("ahmet",25)])
df = rdd.toDF( "name STRING, age INT" )
df.show()
➤ You can access underlying RDD of a DataFrame using rdd
property:
df.rdd.collect()
[Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)]
19
EXAMPLE #3: CREATE TEMPORARY VIEWS FROM DATAFRAMES
spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" ) 
.createOrReplaceTempView( "users" )
spark.sql( "select count(*) from users" ).show()
spark.sql( "select case when age < 25 then '-25' 
when age between 25 and 39 then '25-40' 
when age >= 40 then '40+' end age_group, 
count(*) from users group by age_group order by 1" ).show()
20
EXAMPLE #4: READ AND WRITE DATA
df = spark.read.load( "users.csv", format="csv", sep="|" ) 
.toDF( "id","age","gender","occupation","zip" )
df.write.saveAsTable("users")
df .write.save("users.json", format="json", mode="overwrite")
spark.sql("SELECT gender, count(*) FROM 
json.`users.json` GROUP BY gender").show()
21
HIVE
SPARK STREAMING (DSTREAMS)
➤ Scalable, high-throughput, fault-tolerant stream processing of
live data streams
➤ Supports: File, Socket, Kafka, Flume, Kinesis
➤ Spark Streaming receives live input data streams and divides
the data into batches
22
EXAMPLE #5: DISCRETIZED STREAMS (DSTREAMS)
ssc = StreamingContext(sc, 1)
stream_data = ssc.textFileStream("file:///tmp/stream") 
.map( lambda x: x.split(","))
stream_data.pprint()
ssc.start()
ssc.awaitTermination()
23
EXAMPLE #5: OUTPUT
24
STRUCTURED STREAMING
➤ Stream processing engine built on the Spark SQL engine
➤ Supports File and Kafka sources for production; Socket and
Rate sources for testing
25
EXAMPLE #6: STRUCTURED STREAMING
stream_data = spark.readStream 
.load( format="csv",path="/tmp/stream/*.csv",
schema="name string, points int" ) 
.groupBy("name").sum("points").orderBy( "sum(points)",
ascending=0 )
stream_data.writeStream.start( format="console",
outputMode="complete" ).awaitTermination()
26
EXAMPLE #6: OUTPUT
27
GRAPHX (GRAPHFRAMES)
➤ GraphX is a new component in Spark for graphs and graph-
parallel computation.
28
EXAMPLE #7: GRAPHFRAMES
vertex =
spark.createDataFrame([
(1, "Ahmet"),
(2, "Mehmet"),
(3, "Cengiz"),
(4, "Osman")],
["id", "name"])
edges =
spark.createDataFrame([
( 1, 2, "friend" ),
( 2, 1, "friend" ),
( 2, 3, "friend" ),
( 3, 2, "friend" ),
( 2, 4, "friend" ),
( 4, 2, "friend" ),
( 3, 4, "friend" ),
( 4, 3, "friend" )],
["src","dst", "relation"])
29
EXAMPLE #7: GRAPHFRAMES
pyspark --packages graphframes:graphframes:0.5.0-spark2.1-
s_2.11
import graphframes as gf
g = gf.GraphFrame(vertex, edges)
g.shortestPaths([4]).show()
30
1
2
3
4
MLLIB (MACHINE LEARNING)
➤ Supports common ML Algorithms such as classification,
regression, clustering, and collaborative filtering
➤ Featurization:
➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...)
➤ Transformation (Tokenizer, StopWordsRemover ...)
➤ Selection (VectorSlicer, RFormula ... )
➤ Pipelines: combine multiple algorithms into a single pipeline,
or workflow
➤ DataFrame-based API is primary API
31
EXAMPLE #8: ALTERNATING LEAST SQUARES (ALS)
def parseratings( x ):
v = x.split("::")
return (int(v[0]), int(v[1]), float(v[2]))
ratings = sc.textFile("ratings.dat").map(parseratings) 
.toDF( ["user", "id", "rating"] )
als = ALS(userCol="user", itemCol="id", ratingCol="rating")
model = als.fit(ratings)
model.recommendForAllUsers(10).show()
32
EXAMPLE #8 OUTPUT
33
Blog: www.gokhanatil.com Twitter: @gokhanatil

Introduction to Spark with Python

  • 1.
  • 2.
    GÖKHAN ATIL ➤ DatabaseAdministrator ➤ Oracle ACE Director (2016)
 ACE (2011) ➤ 10g/11g and R12 Oracle Certified Professional (OCP) ➤ Co-author of Expert Oracle Enterprise Manager 12c ➤ Founding Member and Vice President of TROUG ➤ Blogger (since 2008) gokhanatil.com ➤ Twitter: @gokhanatil 2
  • 3.
    APACHE SPARK WITHPYTHON ➤ Introduction to Apache Spark ➤ Why Python (PySpark) instead of Scala? ➤ Spark RDD ➤ SQL and DataFrames ➤ Spark Streaming ➤ Spark Graphx ➤ Spark MLlib (Machine Learning) 3
  • 4.
    INTRODUCTION TO APACHESPARK ➤ A fast and general engine for large-scale data processing ➤ Top-Level Apache Project since 2014. ➤ Response to limitations in the MapReduce ➤ Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ➤ Implemented in Scala programming language, supports Java, Scala, Python, R ➤ Runs on Hadoop, Mesos, Kubernetes, standalone, cloud 4
  • 5.
    DOWNLOAD AND RUNON YOUR PC ➤ https://spark.apache.org/downloads.html ➤ Extract and Spark is ready: tar -xzf spark-2.3.0-bin-hadoop2.7.tgz spark-2.3.0-bin-hadoop2.7/bin/spark ➤ You can also use PIP: pip install pyspark 5
  • 6.
    PYSPARK AND SPARK-SUBMIT ➤PySpark is the interface that gives access to Spark using the Python programming language ➤ The spark-submit script in Spark’s bin directory is used to launch applications on a cluster spark-submit example1.py 6
  • 7.
    WHY PYTHON INSTEADOF SCALA? ➤ If you know Scala, then use Scala! ➤ Learning curve: Python is comparatively easier to learn ➤ Easy to use: Code readability, maintainability and familiarity is far better with Python ➤ Libraries: Python comes with great libraries for data analysis, statistics and visualization (numpy, pandas, matplotlib etc...) ➤ Performance:  Scala is faster then Python but if your Python code just calls Spark libraries, the differences in performance is minimal (*) 7 Reminder: Any new feature added in Spark API will be available in Scala first
  • 8.
    RESILIENT DISTRIBUTED DATASET(RDD) ➤ RDDs are the core data structure in Spark ➤ Distributed, resilient, immutable, can store unstructured and structured data, lazy evaluated 8 node 1 RDD partition 1 node 2 RDD partition 2 node 3 RDD partition 3 RDD
  • 9.
    RDD TRANSFORMATIONS ANDACTIONS sc.textFile(*) 9 RDD T1 T2 T3 ACTION LAZY EVALUATATION SPARK CONTEXT .collect().reduceByKey(*).filter(*).map(*)
  • 10.
    TRANSFORMATIONS ACTIONS ➤ map ➤filter ➤ flatMap ➤ mapPartitions ➤ reduceByKey ➤ union ➤ intersection ➤ join 10 ➤ collect ➤ count ➤ first ➤ take ➤ takeSample ➤ takeOrdered ➤ saveAsTextFile ➤ foreach
  • 11.
    HOW TO CREATERDD IN PYSPARK ➤ Referencing a dataset in an external storage system: rdd = sc.textFile( ... ) ➤ Parallelizing already existing collection: rdd = sc.parallelize( ... ) ➤ Creating RDD from already existing RDDs: rdd2 = rdd1.map( ... ) 11
  • 12.
    USERS.CSV (MOVIELENS DATABASE) id| age | gender | occupation | zip 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344 8|36|M|administrator|05201 12 M = 670 F = 273
  • 13.
    EXAMPLE #1: USERDD TO GROUP DATA FROM CSV from pyspark import SparkContext sc = SparkContext.getOrCreate() print sc.textFile( "users.csv" ) .map( lambda x: (x.split("|")[2], 1) ) .reduceByKey(lambda x,y:x+y).collect() sc.stop() 13 M, 1 M, 1 F, 1 M, 1 [(u'M', 670), (u'F', 273)]
  • 14.
    SPARK SQL ANDDATAFRAMES 14 Catalyst RDD DataFrames/DataSetsSQL SPARKSQL MLlib GraphFrames Structured Streaming ➤ Spark SQL is Apache Spark's module for working with structured data
  • 15.
    DATAFRAMES AND DATASETS ➤DataFrame is a distributed collection of "structured" data, organized into named columns. ➤ Spark DataSets are statically typed, while Python is a dynamically typed programming language so Python supports only DataFrames. 15
  • 16.
    EXAMPLE #2: USEDATAFRAME TO GROUP DATA FROM CSV from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .groupby( "gender" ).count().show() sc.stop() 16
  • 17.
  • 18.
    CATALYST OPTIMIZER ➤ SparkSQL uses Catalyst optimizer to optimize query plans. ➤ Supports cost-based optimization since Spark 2.2 18 SQL DataFrame DataSet Query Plan Optimized Query Plan RDD Code Generation
  • 19.
    CONVERSION BETWEEN RDDAND DATAFRAME ➤ An RDD can be converted to DataFrame using createDataFrame or toDF method: rdd = sc.parallelize([("osman",21),("ahmet",25)]) df = rdd.toDF( "name STRING, age INT" ) df.show() ➤ You can access underlying RDD of a DataFrame using rdd property: df.rdd.collect() [Row(name=u'osman',age=21),Row(name=u'ahmet',age=25)] 19
  • 20.
    EXAMPLE #3: CREATETEMPORARY VIEWS FROM DATAFRAMES spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) .createOrReplaceTempView( "users" ) spark.sql( "select count(*) from users" ).show() spark.sql( "select case when age < 25 then '-25' when age between 25 and 39 then '25-40' when age >= 40 then '40+' end age_group, count(*) from users group by age_group order by 1" ).show() 20
  • 21.
    EXAMPLE #4: READAND WRITE DATA df = spark.read.load( "users.csv", format="csv", sep="|" ) .toDF( "id","age","gender","occupation","zip" ) df.write.saveAsTable("users") df .write.save("users.json", format="json", mode="overwrite") spark.sql("SELECT gender, count(*) FROM json.`users.json` GROUP BY gender").show() 21 HIVE
  • 22.
    SPARK STREAMING (DSTREAMS) ➤Scalable, high-throughput, fault-tolerant stream processing of live data streams ➤ Supports: File, Socket, Kafka, Flume, Kinesis ➤ Spark Streaming receives live input data streams and divides the data into batches 22
  • 23.
    EXAMPLE #5: DISCRETIZEDSTREAMS (DSTREAMS) ssc = StreamingContext(sc, 1) stream_data = ssc.textFileStream("file:///tmp/stream") .map( lambda x: x.split(",")) stream_data.pprint() ssc.start() ssc.awaitTermination() 23
  • 24.
  • 25.
    STRUCTURED STREAMING ➤ Streamprocessing engine built on the Spark SQL engine ➤ Supports File and Kafka sources for production; Socket and Rate sources for testing 25
  • 26.
    EXAMPLE #6: STRUCTUREDSTREAMING stream_data = spark.readStream .load( format="csv",path="/tmp/stream/*.csv", schema="name string, points int" ) .groupBy("name").sum("points").orderBy( "sum(points)", ascending=0 ) stream_data.writeStream.start( format="console", outputMode="complete" ).awaitTermination() 26
  • 27.
  • 28.
    GRAPHX (GRAPHFRAMES) ➤ GraphXis a new component in Spark for graphs and graph- parallel computation. 28
  • 29.
    EXAMPLE #7: GRAPHFRAMES vertex= spark.createDataFrame([ (1, "Ahmet"), (2, "Mehmet"), (3, "Cengiz"), (4, "Osman")], ["id", "name"]) edges = spark.createDataFrame([ ( 1, 2, "friend" ), ( 2, 1, "friend" ), ( 2, 3, "friend" ), ( 3, 2, "friend" ), ( 2, 4, "friend" ), ( 4, 2, "friend" ), ( 3, 4, "friend" ), ( 4, 3, "friend" )], ["src","dst", "relation"]) 29
  • 30.
    EXAMPLE #7: GRAPHFRAMES pyspark--packages graphframes:graphframes:0.5.0-spark2.1- s_2.11 import graphframes as gf g = gf.GraphFrame(vertex, edges) g.shortestPaths([4]).show() 30 1 2 3 4
  • 31.
    MLLIB (MACHINE LEARNING) ➤Supports common ML Algorithms such as classification, regression, clustering, and collaborative filtering ➤ Featurization: ➤ Feature extraction (TF-IDF, Word2Vec, CountVectorizer ...) ➤ Transformation (Tokenizer, StopWordsRemover ...) ➤ Selection (VectorSlicer, RFormula ... ) ➤ Pipelines: combine multiple algorithms into a single pipeline, or workflow ➤ DataFrame-based API is primary API 31
  • 32.
    EXAMPLE #8: ALTERNATINGLEAST SQUARES (ALS) def parseratings( x ): v = x.split("::") return (int(v[0]), int(v[1]), float(v[2])) ratings = sc.textFile("ratings.dat").map(parseratings) .toDF( ["user", "id", "rating"] ) als = ALS(userCol="user", itemCol="id", ratingCol="rating") model = als.fit(ratings) model.recommendForAllUsers(10).show() 32
  • 33.
  • 34.