7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Images haven’t loaded yet. Please exit printing, wait for images to load, and try to
Using Docker and PySpark print again.
Bryant Crocker Follow
Jan 9 · 7 min read
Recently, I have been playing with PySpark a bit and decided I would
write a blog post about using PySpark and Spark SQL. Spark is a great
open source tool for munging data and machine learning across
distributed computing clusters. PySpark is the python API to Spark.
PySpark can be a bit difficult to get up and running on your machine.
Docker is a quick and easy way to get a Spark environment working on
your local machine and is how I run PySpark on my local machine.
What is Docker?
I’ll start by giving an introduction to Docker. According to wikipedia
“Docker is a computer program that performs operating-system-level
virtualization, also known as ‘containerization’ ”. To greatly simplify,
Docker creates a walled off linux operating system to run software on
top of your machine’s OS called a container. For those familiar with
virtual machines, a container is basically a vm without a hypervisor.
These containers can be preconfigured with scripts to install specific
software and provide customized functionality. Dockerhub is a website
that contains various preconfigured docker containers that can be
quickly run on your computer. One of these is the
jupyter/pysparknotebook. This is the docker image we will be using
today.
Starting up the Docker container:
Setting up a Docker container on your local machine is pretty simple.
Simply download docker from the docker website and run the
following command in the terminal:
docker run -it -p 8888:8888 jupyter/pyspark-notebook
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 1/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
navigate to http://localhost:8888 in your browser and you will see the
following screen:
In your terminal you should see a token:
copy and paste this token, the numbers following “/?token=”, into the
token textbook and set a password for the Jupyter notebook server in
the New Password box.
With that done, you are all set to go! Spark is already installed in the
container. You are all ready to open up a notebook and start writing
some Spark code. I will include a copy of the notebook but I would
recommend entering the code from this article into a new Jupyter
notebook on your local computer. This helps you to learn.
To stop the docker container and Jupyter notebook server, simply enter
control + c in the terminal that is running it.
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 2/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
PySpark Basics
Spark is an open source cluster computing framework written in mostly
scala with APIs in R, python, scala and java. It is made mostly for large
scale data analysis and machine learning that cannot fit into local
memory. In this brief tutorial, I will not use a dataset that is too big to
fit into memory. This tutorial borrows from the official getting starting
guide: https://spark.apache.org/docs/latest/sql-getting-started.html.
Spark Datatypes:
There are two main datatypes in the spark ecosystem, Resilient
Distributed Datasets or RDDs (which are kind of like a cross between a
python list and dictionary) and dataframes (dataframes much like in R
and python). Both data types in spark are partitioned and immutable
(which means you cannot change the object, a new one is returned
instead). In this tutorial I am going to focus on the dataframe datatype.
The Dataset:
The dataset that I will be using is a somewhat large Vermont vendor
data dataset from the Vermont open data Socrata portal. It can be
downloaded easily by following the link.
Setting up a Spark session:
This code snippet starts up the PySpark enviroment in the docker
container and imports basic libraries for numerical computing.
# import necessary libraries
import pandas as pd
import numpy
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
# create sparksession
spark = SparkSession \
.builder \
.appName("Pysparkexample") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 3/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Reading in a CSV:
I wanted to start by comparing reading in a CSV with pandas vs Spark.
Spark ends up reading in the CSV much faster than pandas. This
demonstrates how Spark dataframes are much faster when compared
to their pandas equivalent.
For this analysis I will read in the data using the inferSchema option
and cast the Amount column to a double.
df = spark.read.csv('Vermont_Vendor_Payments (1).csv',
header='true', inferSchema = True)
df = df.withColumn("Amount", df["Amount"].cast("double"))
Basic Spark Methods:
like with pandas, we access column names with the .columns attribute
of the dataframe.
#we can use the columns attribute just like with pandas
columns = df.columns
print('The column Names are:')
for i in columns:
print(i)
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 4/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
We can get the number of rows using the .count() method and we can
get the number of columns by taking the length of the column names.
print('The total number of rows is:', df.count(), '\nThe
total number of columns is:', len(df.columns))
The .show() method prints the first 20 rows of the dataframe by
default. I chose to only print 5 in this article.
#show first 5 rows
df.show(5)
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 5/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
The .head() method can also be used to display the first row. This prints
much nicer in the notebook.
#show first row
df.head()
Like in pandas, we can call the describe method to get basic numerical
summaries of the data. We need to use the show method to print it to
the notebook.This does not print very nicely in the notebook.
df.describe().show()
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 6/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Querying the data:
One of the strengths of Spark is that it can be queried with each
language’s respective Spark library or with Spark SQL. I will
demonstrate a few queries using both the pythonic and SQL options.
The following code registers temporary table and selects a few columns
using SQL syntax:
# I will start by creating a temporary table query with SQL
df.createOrReplaceTempView('VermontVendor')
spark.sql(
'''
SELECT `Quarter Ending`, Department, Amount, State FROM
VermontVendor
LIMIT 10
'''
).show()
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 7/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
This code performs pretty much the same operation using pythonic
syntax:
df.select('Quarter Ending', 'Department', 'Amount',
'State').show(10)
One thing to note is that the pythonic solution is significantly less code.
I like SQL and it’s syntax, so I prefer the SQL interface over the pythonic
one.
I can filter the columns selected in my query using the SQL WHERE
clause
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 8/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
spark.sql(
'''
SELECT `Quarter Ending`, Department, Amount, State FROM
VermontVendor
WHERE Department = 'Education'
LIMIT 10
'''
).show()
A similar result can be achieved with the .filter() method in the python
API.
df.select('Quarter Ending', 'Department', 'Amount',
'State').filter(df['Department'] == 'Education').show(10)
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 9/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Plotting
Unfortunately, one cannot directly create plots with a Spark dataframe.
The simplest solution is to simply use the .toPandas() method to
convert the result of Spark computations to a pandas dataframe. I give
a couple examples below.
plot_df = spark.sql(
'''
SELECT Department, SUM(Amount) as Total FROM VermontVendor
GROUP BY Department
ORDER BY Total DESC
LIMIT 10
'''
).toPandas()
fig,ax = plt.subplots(1,1,figsize=(10,6))
plot_df.plot(x = 'Department', y = 'Total', kind = 'barh',
color = 'C0', ax = ax, legend = False)
ax.set_xlabel('Department', size = 16)
ax.set_ylabel('Total', size = 16)
plt.savefig('barplot.png')
plt.show()
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 10/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
import numpy as np
import seaborn as sns
plot_df2 = spark.sql(
'''
SELECT Department, SUM(Amount) as Total FROM VermontVendor
GROUP BY Department
'''
).toPandas()
plt.figure(figsize = (10,6))
sns.distplot(np.log(plot_df2['Total']))
plt.title('Histogram of Log Totals for all Departments in
Dataset', size = 16)
plt.ylabel('Density', size = 16)
plt.xlabel('Log Total', size = 16)
plt.savefig('distplot.png')
plt.show()
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 11/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Starting up you docker container again:
Once you have started and exited out of your docker container the first
time, you will start it differently for future uses since the container has
already been run.
Pass the following command to return all container names:
docker ps -a
Get the container id from the terminal:
Then run docker start with the container id to start the container:
docker start 903f152e92c5
Your Jupyter notebook server will then again be running on
http://localhost:8888.
The full code with a few more examples can be found on my github:
https://github.com/crocker456/PlayingWithPyspark
Sources:
PySpark 2.0 The size or shape of a DataFrame
Thanks for contributing an answer to Stack
Over ow! Some of your past answers have not…
stackover ow.com
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 12/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
Getting Started - Spark 2.4.0 Documentation
Edit description
spark.apache.org
. . .
Learn Python - Best Python Tutorials (2019) |
gitconnected
The top 77 Python tutorials. Courses are submitted
and voted on by developers, enabling you to nd…
gitconnected.com
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 13/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 14/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 15/15