100% found this document useful (1 vote)

156 views15 pages

Docker & PySpark for Data Enthusiasts

The document discusses using Docker and PySpark to analyze data. It describes how to start a Docker container with a preconfigured PySpark environment using a single command. This provides an easy way to run PySpark locally without complex installation. The document then demonstrates basic PySpark concepts like reading CSV data, exploring the data schema, and computing summary statistics. Using Docker allows focusing on PySpark coding without managing software dependencies.

Uploaded by

Edgar Martínez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

156 views15 pages

Docker & PySpark for Data Enthusiasts

Uploaded by

Edgar Martínez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Images haven’t loaded yet. Please exit printing, wait for images to load, and try to
Using Docker and PySpark print again.

Bryant Crocker Follow

Jan 9 · 7 min read

Recently, I have been playing with PySpark a bit and decided I would
write a blog post about using PySpark and Spark SQL. Spark is a great
open source tool for munging data and machine learning across
distributed computing clusters. PySpark is the python API to Spark.

PySpark can be a bit difficult to get up and running on your machine.

Docker is a quick and easy way to get a Spark environment working on
your local machine and is how I run PySpark on my local machine.

What is Docker?
I’ll start by giving an introduction to Docker. According to wikipedia
“Docker is a computer program that performs operating-system-level
virtualization, also known as ‘containerization’ ”. To greatly simplify,
Docker creates a walled off linux operating system to run software on
top of your machine’s OS called a container. For those familiar with
virtual machines, a container is basically a vm without a hypervisor.
These containers can be preconfigured with scripts to install specific
software and provide customized functionality. Dockerhub is a website
that contains various preconfigured docker containers that can be
quickly run on your computer. One of these is the
jupyter/pysparknotebook. This is the docker image we will be using
today.

Starting up the Docker container:

Setting up a Docker container on your local machine is pretty simple.
Simply download docker from the docker website and run the
following command in the terminal:

docker run -it -p 8888:8888 jupyter/pyspark-notebook

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 1/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

navigate to http://localhost:8888 in your browser and you will see the

following screen:

In your terminal you should see a token:

copy and paste this token, the numbers following “/?token=”, into the
token textbook and set a password for the Jupyter notebook server in
the New Password box.

With that done, you are all set to go! Spark is already installed in the
container. You are all ready to open up a notebook and start writing
some Spark code. I will include a copy of the notebook but I would
recommend entering the code from this article into a new Jupyter
notebook on your local computer. This helps you to learn.

To stop the docker container and Jupyter notebook server, simply enter
control + c in the terminal that is running it.

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 2/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

PySpark Basics
Spark is an open source cluster computing framework written in mostly
scala with APIs in R, python, scala and java. It is made mostly for large
scale data analysis and machine learning that cannot fit into local
memory. In this brief tutorial, I will not use a dataset that is too big to
fit into memory. This tutorial borrows from the official getting starting
guide: https://spark.apache.org/docs/latest/sql-getting-started.html.

Spark Datatypes:
There are two main datatypes in the spark ecosystem, Resilient
Distributed Datasets or RDDs (which are kind of like a cross between a
python list and dictionary) and dataframes (dataframes much like in R
and python). Both data types in spark are partitioned and immutable
(which means you cannot change the object, a new one is returned
instead). In this tutorial I am going to focus on the dataframe datatype.

The Dataset:
The dataset that I will be using is a somewhat large Vermont vendor
data dataset from the Vermont open data Socrata portal. It can be
downloaded easily by following the link.

Setting up a Spark session:

This code snippet starts up the PySpark enviroment in the docker
container and imports basic libraries for numerical computing.

# import necessary libraries

import pandas as pd
import numpy
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession

# create sparksession
spark = SparkSession \
.builder \
.appName("Pysparkexample") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 3/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Reading in a CSV:
I wanted to start by comparing reading in a CSV with pandas vs Spark.
Spark ends up reading in the CSV much faster than pandas. This
demonstrates how Spark dataframes are much faster when compared
to their pandas equivalent.

For this analysis I will read in the data using the inferSchema option
and cast the Amount column to a double.

df = spark.read.csv('Vermont_Vendor_Payments (1).csv',
header='true', inferSchema = True)
df = df.withColumn("Amount", df["Amount"].cast("double"))

Basic Spark Methods:

like with pandas, we access column names with the .columns attribute
of the dataframe.

#we can use the columns attribute just like with pandas
columns = df.columns
print('The column Names are:')
for i in columns:
print(i)

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 4/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

We can get the number of rows using the .count() method and we can
get the number of columns by taking the length of the column names.

print('The total number of rows is:', df.count(), '\nThe

total number of columns is:', len(df.columns))

The .show() method prints the first 20 rows of the dataframe by

default. I chose to only print 5 in this article.

#show first 5 rows

df.show(5)

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 5/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

The .head() method can also be used to display the first row. This prints
much nicer in the notebook.

#show first row

df.head()

Like in pandas, we can call the describe method to get basic numerical
summaries of the data. We need to use the show method to print it to
the notebook.This does not print very nicely in the notebook.

df.describe().show()

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 6/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Querying the data:

One of the strengths of Spark is that it can be queried with each
language’s respective Spark library or with Spark SQL. I will
demonstrate a few queries using both the pythonic and SQL options.

The following code registers temporary table and selects a few columns
using SQL syntax:

# I will start by creating a temporary table query with SQL

df.createOrReplaceTempView('VermontVendor')
spark.sql(
'''
SELECT `Quarter Ending`, Department, Amount, State FROM
VermontVendor
LIMIT 10
'''
).show()

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 7/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

This code performs pretty much the same operation using pythonic
syntax:

df.select('Quarter Ending', 'Department', 'Amount',

'State').show(10)

One thing to note is that the pythonic solution is significantly less code.
I like SQL and it’s syntax, so I prefer the SQL interface over the pythonic
one.

I can filter the columns selected in my query using the SQL WHERE
clause

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 8/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

spark.sql(
'''

SELECT `Quarter Ending`, Department, Amount, State FROM

VermontVendor
WHERE Department = 'Education'
LIMIT 10

'''
).show()

A similar result can be achieved with the .filter() method in the python

API.

df.select('Quarter Ending', 'Department', 'Amount',

'State').filter(df['Department'] == 'Education').show(10)

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 9/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Plotting
Unfortunately, one cannot directly create plots with a Spark dataframe.
The simplest solution is to simply use the .toPandas() method to
convert the result of Spark computations to a pandas dataframe. I give
a couple examples below.

plot_df = spark.sql(
'''

SELECT Department, SUM(Amount) as Total FROM VermontVendor

GROUP BY Department
ORDER BY Total DESC
LIMIT 10

'''
).toPandas()

fig,ax = plt.subplots(1,1,figsize=(10,6))
plot_df.plot(x = 'Department', y = 'Total', kind = 'barh',
color = 'C0', ax = ax, legend = False)
ax.set_xlabel('Department', size = 16)
ax.set_ylabel('Total', size = 16)
plt.savefig('barplot.png')
plt.show()

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 10/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

import numpy as np
import seaborn as sns
plot_df2 = spark.sql(
'''
SELECT Department, SUM(Amount) as Total FROM VermontVendor
GROUP BY Department
'''
).toPandas()
plt.figure(figsize = (10,6))
sns.distplot(np.log(plot_df2['Total']))
plt.title('Histogram of Log Totals for all Departments in
Dataset', size = 16)
plt.ylabel('Density', size = 16)
plt.xlabel('Log Total', size = 16)
plt.savefig('distplot.png')
plt.show()

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 11/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Starting up you docker container again:

Once you have started and exited out of your docker container the first
time, you will start it differently for future uses since the container has
already been run.

Pass the following command to return all container names:

docker ps -a

Get the container id from the terminal:

Then run docker start with the container id to start the container:

docker start 903f152e92c5

Your Jupyter notebook server will then again be running on

http://localhost:8888.

The full code with a few more examples can be found on my github:

https://github.com/crocker456/PlayingWithPyspark

Sources:

PySpark 2.0 The size or shape of a DataFrame

Thanks for contributing an answer to Stack

Over ow! Some of your past answers have not…
stackover ow.com

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 12/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Getting Started - Spark 2.4.0 Documentation

Edit description
spark.apache.org

. . .

Learn Python - Best Python Tutorials (2019) |

gitconnected

The top 77 Python tutorials. Courses are submitted

and voted on by developers, enabling you to nd…
gitconnected.com

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 13/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 14/15
7/5/2019 Using Docker and Pyspark – Levelup Your Coding

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867 15/15

Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Data Engineering Guide for Experts
No ratings yet
Data Engineering Guide for Experts
97 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
Spark
No ratings yet
Spark
160 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Kafka Core Concepts Guide
100% (1)
Kafka Core Concepts Guide
76 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
14 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Spark
No ratings yet
Spark
13 pages
Data Engineering 100-Day Plan
No ratings yet
Data Engineering 100-Day Plan
6 pages
Python Virtual Environment
No ratings yet
Python Virtual Environment
23 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
PySpark SparkSession Guide
No ratings yet
PySpark SparkSession Guide
63 pages
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
PySpark Programming & Spark SQL Guide
No ratings yet
PySpark Programming & Spark SQL Guide
7 pages
Use Delta Lake in Azure Synapse Analytics
No ratings yet
Use Delta Lake in Azure Synapse Analytics
37 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
400 Python Exercise
No ratings yet
400 Python Exercise
27 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
25 Python Materials
No ratings yet
25 Python Materials
3 pages
Ebook Python Interview Guide
No ratings yet
Ebook Python Interview Guide
15 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Spark SQL Built in Functions List 1666128345
No ratings yet
Spark SQL Built in Functions List 1666128345
143 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
Data Engineer Interview Prep
No ratings yet
Data Engineer Interview Prep
27 pages
Pandas Series and DataFrame Guide
No ratings yet
Pandas Series and DataFrame Guide
87 pages
Apache Spark 2.3: Key Updates
No ratings yet
Apache Spark 2.3: Key Updates
57 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Batch Processing with Spark Guide
No ratings yet
Batch Processing with Spark Guide
41 pages
Python Pandas Interview Questions and Answers
No ratings yet
Python Pandas Interview Questions and Answers
20 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Python - Environment Setup
No ratings yet
Python - Environment Setup
10 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
Python Oops
No ratings yet
Python Oops
10 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
My Jupyter Docker Full Stack
No ratings yet
My Jupyter Docker Full Stack
33 pages
Setting Up Apache Spark From Scratch in A Docker Container - A Step-By-Step Guide - by Sanjeet Shukla - Medium
No ratings yet
Setting Up Apache Spark From Scratch in A Docker Container - A Step-By-Step Guide - by Sanjeet Shukla - Medium
14 pages
Kubernetes For Developers
No ratings yet
Kubernetes For Developers
2 pages
Docker & PySpark for Data Enthusiasts
100% (1)
Docker & PySpark for Data Enthusiasts
15 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Russian Verbs of Motion PDF
No ratings yet
Russian Verbs of Motion PDF
276 pages
Orchestrating The Cloud With Kubernetes
No ratings yet
Orchestrating The Cloud With Kubernetes
26 pages
01 - YABLUKO - Ukrainian Elementary - Student's Book by UkrainianSchoolUCU - Issuu
71% (7)
01 - YABLUKO - Ukrainian Elementary - Student's Book by UkrainianSchoolUCU - Issuu
83 pages
Código Del Documento HTML.: Factura
No ratings yet
Código Del Documento HTML.: Factura
4 pages
Yucatec Mayan Dictionary and Phrasebook PDF
100% (2)
Yucatec Mayan Dictionary and Phrasebook PDF
108 pages
Appel Maths For Physicists PDF
100% (1)
Appel Maths For Physicists PDF
666 pages
Aircraft Modelling
No ratings yet
Aircraft Modelling
152 pages
Monty Python's Holy Grail
100% (2)
Monty Python's Holy Grail
61 pages
CSC 111 NOTE Complete
No ratings yet
CSC 111 NOTE Complete
75 pages
CDP For EC8004-Wireless Networks
50% (2)
CDP For EC8004-Wireless Networks
7 pages
ATmega Power Down
No ratings yet
ATmega Power Down
3 pages
Ses-Cdegs 2k - Malz
100% (1)
Ses-Cdegs 2k - Malz
75 pages
Blog Post Planning Template v1.3 PDF
No ratings yet
Blog Post Planning Template v1.3 PDF
1 page
Game Theory: Two-Person Zero-Sum
100% (3)
Game Theory: Two-Person Zero-Sum
4 pages
Three-Level Database Architecture Explained
0% (1)
Three-Level Database Architecture Explained
4 pages
MS Access Exploitation Guide
No ratings yet
MS Access Exploitation Guide
25 pages
Avpreserve 1/4" Open Reel Audio Tape Time Chart: Select Cells & Use Picklists
No ratings yet
Avpreserve 1/4" Open Reel Audio Tape Time Chart: Select Cells & Use Picklists
4 pages
Instruction Set, Machine Code and Addressing Mode: Mnemonics, Opcodes and Assembler Languages
100% (1)
Instruction Set, Machine Code and Addressing Mode: Mnemonics, Opcodes and Assembler Languages
16 pages
Lambda Simulation
No ratings yet
Lambda Simulation
6 pages
SeTracker2 User Guide & Setup Instructions
No ratings yet
SeTracker2 User Guide & Setup Instructions
12 pages
II-i DLD r22 Question Bank
No ratings yet
II-i DLD r22 Question Bank
3 pages
Keyence General Catalog
No ratings yet
Keyence General Catalog
44 pages
U3p STPM 2014 Semester 3 SMKDJ
No ratings yet
U3p STPM 2014 Semester 3 SMKDJ
6 pages
8051 Microcontroller Guide
100% (1)
8051 Microcontroller Guide
10 pages
ISO 20000 Requirements by Process (Blank Template)
No ratings yet
ISO 20000 Requirements by Process (Blank Template)
24 pages
RTU Redundancy Test Report Grati
No ratings yet
RTU Redundancy Test Report Grati
6 pages
Module 9: Designing Security Specifications
No ratings yet
Module 9: Designing Security Specifications
56 pages
51 Log Siemens PDF
No ratings yet
51 Log Siemens PDF
2 pages
VLab Demo - ASM - Blocking Cross-Site Scripting Attacks - V13.0.A
No ratings yet
VLab Demo - ASM - Blocking Cross-Site Scripting Attacks - V13.0.A
6 pages
Network Security & Cryptography Book
100% (1)
Network Security & Cryptography Book
204 pages
Maxbox Starter 10: Start With Statistic Programming
No ratings yet
Maxbox Starter 10: Start With Statistic Programming
7 pages
PHP Basics for WordPress Beginners
No ratings yet
PHP Basics for WordPress Beginners
57 pages
URS Contents: Blank Template
100% (1)
URS Contents: Blank Template
11 pages
14CPL16 - Computer Programming Lab Manual (VTU PESIT) PDF
No ratings yet
14CPL16 - Computer Programming Lab Manual (VTU PESIT) PDF
25 pages
Student-Adviser Agreement Form
No ratings yet
Student-Adviser Agreement Form
1 page
Voronoi Diagrams in Page Segmentation
100% (1)
Voronoi Diagrams in Page Segmentation
4 pages
1st 1 Sequence
No ratings yet
1st 1 Sequence
21 pages

Docker & PySpark for Data Enthusiasts

Uploaded by

Docker & PySpark for Data Enthusiasts

Uploaded by

7/5/2019 Using Docker and Pyspark – Levelup Your Coding

Bryant Crocker Follow

PySpark can be a bit difficult to get up and running on your machine.

Starting up the Docker container:

docker run -it -p 8888:8888 jupyter/pyspark-notebook

navigate to http://localhost:8888 in your browser and you will see the

In your terminal you should see a token:

Setting up a Spark session:

# import necessary libraries

Basic Spark Methods:

print('The total number of rows is:', df.count(), '\nThe

The .show() method prints the first 20 rows of the dataframe by

#show first 5 rows

#show first row

Querying the data:

# I will start by creating a temporary table query with SQL

df.select('Quarter Ending', 'Department', 'Amount',

SELECT `Quarter Ending`, Department, Amount, State FROM

A similar result can be achieved with the .filter() method in the python

df.select('Quarter Ending', 'Department', 'Amount',

SELECT Department, SUM(Amount) as Total FROM VermontVendor

Starting up you docker container again:

Pass the following command to return all container names:

Get the container id from the terminal:

docker start 903f152e92c5

Your Jupyter notebook server will then again be running on

PySpark 2.0 The size or shape of a DataFrame

Thanks for contributing an answer to Stack

Getting Started - Spark 2.4.0 Documentation

Learn Python - Best Python Tutorials (2019) |

The top 77 Python tutorials. Courses are submitted

You might also like