0% found this document useful (0 votes)

211 views7 pages

Spark DataFrames Project Exercise - Jupyter Notebook

The document provides instructions for analyzing stock market data from Walmart from 2012-2017 using Spark DataFrames. It asks the user to load the data, examine the schema and columns, calculate summary statistics, and answer questions about the data such as the peak high price, mean close price, and correlation between high prices and volume.

Uploaded by

ROHAN CHOPDE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

211 views7 pages

Spark DataFrames Project Exercise - Jupyter Notebook

Uploaded by

ROHAN CHOPDE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Spark DataFrames Project Exercise

Let's get some quick practice with your new Spark DataFrame skills, you will be asked some basic questions
about some stock market data, in this case Walmart Stock from the years 2012-2017. This exercise will just ask
a bunch of questions, unlike the future machine learning exercises, which will be a little looser and be in the
form of "Consulting Projects", but more on that later!

For now, just answer the questions and complete the tasks below.

Use the walmart_stock.csv file to Answer and complete the tasks below!

Start a simple Spark Session

In [1]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('walmart').getOrCreate()

Load the Walmart Stock CSV File, have Spark infer the data types.

In [2]:

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

What are the column names?

In [3]:

df.columns

Out[3]:

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

What does the Schema look like?

In [5]:

df.printSchema()

root

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 1/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Print out the first 5 columns.

In [8]:

for line in df.head(5):

print(line,'\n')

Row(Date='2012-01-03', Open=59.970001, High=61.060001, Low=59.869999, Close=

60.330002, Volume=12668800, Adj Close=52.619234999999996)

Row(Date='2012-01-04', Open=60.209998999999996, High=60.349998, Low=59.47000

1, Close=59.709998999999996, Volume=9593300, Adj Close=52.078475)

Row(Date='2012-01-05', Open=59.349998, High=59.619999, Low=58.369999, Close=

59.419998, Volume=12768200, Adj Close=51.825539)

Row(Date='2012-01-06', Open=59.419998, High=59.450001, Low=58.869999, Close=

59.0, Volume=8069400, Adj Close=51.45922)

Row(Date='2012-01-09', Open=59.029999, High=59.549999, Low=58.919998, Close=

59.18, Volume=6679300, Adj Close=51.616215000000004)

Use describe() to learn about the DataFrame.

In [10]:

df.describe().show()

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

|summary| Date| Open| High| Low|

Close| Volume| Adj Close|

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

| count| 1258| 1258| 1258| 1258|

1258| 1258| 1258|

| mean| null| 72.35785375357709|72.83938807631165| 71.9186009594594|7

2.38844998012726|8222093.481717011|67.23883848728146|

| stddev| null| 6.76809024470826|6.768186808159218|6.744075756255496|

6.756859163732991| 4519780.8431556|6.722609449996857|

| min|2012-01-03|56.389998999999996| 57.060001| 56.299999|

56.419998| 2094900| 50.363689|

| max|2016-12-30| 90.800003| 90.970001| 89.25|

90.470001| 80898100|84.91421600000001|

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

Bonus Question!

There are too many decimal places for mean and stddev in the describe() dataframe. Format the
numbers to just show up to two decimal places. Pay careful attention to the datatypes that .describe()
returns, we didn't cover how to do this exact formatting, but we covered something very similar. Check
this link for a hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 2/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

If you get stuck on this, don't worry, just view the solutions.

In [18]:

from pyspark.sql.types import (StructField, StringType,

IntegerType, StructType)
data_schema = [StructField('summary', StringType(), True),
StructField('Open', StringType(), True),
StructField('High', StringType(), True),
StructField('Low', StringType(), True),
StructField('Close', StringType(), True),
StructField('Volume', StringType(), True),
StructField('Adj Close', StringType(), True)
]
final_struc = StructType(fields=data_schema)

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

In [19]:

df.printSchema()

root

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

In [22]:

from pyspark.sql.functions import format_number

summary = df.describe()
summary.select(summary['summary'],
format_number(summary['Open'].cast('float'), 2).alias('Ope
format_number(summary['High'].cast('float'), 2).alias('Hig
format_number(summary['Low'].cast('float'), 2).alias('Low'
format_number(summary['Close'].cast('float'), 2).alias('Cl
format_number(summary['Volume'].cast('int'),0).alias('Volu
).show()

+-------+--------+--------+--------+--------+----------+

|summary| Open| High| Low| Close| Volume|

+-------+--------+--------+--------+--------+----------+

| count|1,258.00|1,258.00|1,258.00|1,258.00| 1,258|

| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|

| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,780|

| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|

| max| 90.80| 90.97| 89.25| 90.47|80,898,100|

+-------+--------+--------+--------+--------+----------+

Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus volume
of stock traded for a day.

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 3/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

In [23]:

df_hv = df.withColumn('HV Ratio', df['High']/df['Volume']).select(['HV Ratio'])

df_hv.show()

+--------------------+

| HV Ratio|

+--------------------+

|4.819714653321546E-6|

|6.290848613094555E-6|

|4.669412994783916E-6|

|7.367338463826307E-6|

|8.915604778943901E-6|

|8.644477436914568E-6|

|9.351828421515645E-6|

| 8.29141562102703E-6|

|7.712212102001476E-6|

|7.071764823529412E-6|

|1.015495466386981E-5|

|6.576354146362592...|

| 5.90145296180676E-6|

|8.547679455011844E-6|

|8.420709512685392E-6|

|1.041448341728929...|

|8.316075414862431E-6|

|9.721183814992126E-6|

|8.029436027707578E-6|

|6.307432259386365E-6|

+--------------------+

only showing top 20 rows

What day had the Peak High in Price?

In [25]:

df.orderBy(df['High'].desc()).select(['Date']).head(1)[0]['Date']

Out[25]:

'2015-01-13'

What is the mean of the Close column?

In [26]:

from pyspark.sql.functions import mean

df.select(mean('Close')).show()

+-----------------+

| avg(Close)|

+-----------------+

|72.38844998012726|

+-----------------+

What is the max and min of the Volume column?

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 4/7
12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

In [27]:

from pyspark.sql.functions import min, max

In [28]:

df.select(max('Volume'),min('Volume')).show()

+-----------+-----------+

|max(Volume)|min(Volume)|

+-----------+-----------+

| 80898100| 2094900|

+-----------+-----------+

How many days was the Close lower than 60 dollars?

In [29]:

df.filter(df['Close'] < 60).count()

Out[29]:

What percentage of the time was the High greater than 80 dollars ?

In other words, (Number of Days High>80)/(Total Days in the dataset)

In [107]:

df.filter('High > 80').count() * 100/df.count()

Out[107]:

9.141494435612083

What is the Pearson correlation between High and Volume?

Hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions.co

In [31]:

df.corr('High', 'Volume')

+-------------------+

| corr(High, Volume)|

+-------------------+

|-0.3384326061737161|

+-------------------+

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 5/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

What is the max High per year?

In [32]:

from pyspark.sql.functions import (dayofmonth, hour,

dayofyear, month,
year, weekofyear,
format_number, date_format)
year_df = df.withColumn('Year', year(df['Date']))
year_df.groupBy('Year').max()['Year', 'max(High)'].show()

+----+---------+

|Year|max(High)|

+----+---------+

|2015|90.970001|

|2013|81.370003|

|2014|88.089996|

|2012|77.599998|

|2016|75.190002|

+----+---------+

What is the average Close for each Calendar Month?

In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your result
will have a value for each of these months.

In [33]:

month_df = df.withColumn('Month', month(df['Date']))

month_df = month_df.groupBy('Month').mean()

month_df = month_df.orderBy('Month')

month_df['Month', 'avg(Close)'].show()

+-----+-----------------+

|Month| avg(Close)|

+-----+-----------------+

| 1|71.44801958415842|

| 2| 71.306804443299|

| 3|71.77794377570092|

| 4|72.97361900952382|

| 5|72.30971688679247|

| 6| 72.4953774245283|

| 7|74.43971943925233|

| 8|73.02981855454546|

| 9|72.18411785294116|

| 10|71.57854545454543|

| 11| 72.1110893069307|

| 12|72.84792478301885|

+-----+-----------------+

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 6/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Great Job!

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 7/7

Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
10
No ratings yet
10
4 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Hive Lab
No ratings yet
Hive Lab
33 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Spark
No ratings yet
Spark
13 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Spark
No ratings yet
Spark
160 pages
MIcrosoft SQL Server 2012 - T-SQL
No ratings yet
MIcrosoft SQL Server 2012 - T-SQL
9 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Datawarehouse PPT
No ratings yet
Datawarehouse PPT
39 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Apache Spark: Unified Framework for Data Processing
No ratings yet
Apache Spark: Unified Framework for Data Processing
6 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Data Warehousing
No ratings yet
Data Warehousing
39 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Spark Scala Interview Question
No ratings yet
Spark Scala Interview Question
3 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Spark SQL
100% (1)
Spark SQL
25 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Mongo DB
No ratings yet
Mongo DB
31 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Python - Environment Setup
No ratings yet
Python - Environment Setup
10 pages
Document Database Data Modeling
No ratings yet
Document Database Data Modeling
27 pages
Elite SQL Query Practice Guide
0% (1)
Elite SQL Query Practice Guide
20 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Problem Submission: This Notebook Is About Using SPARK Dataframe Functions To Process Nsedata - CSV
No ratings yet
Problem Submission: This Notebook Is About Using SPARK Dataframe Functions To Process Nsedata - CSV
3 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
5
No ratings yet
5
1 page
Music Java
No ratings yet
Music Java
6 pages
Char 2403
No ratings yet
Char 2403
189 pages
Gift Java
No ratings yet
Gift Java
6 pages
NumPy Basics for Beginners
100% (1)
NumPy Basics for Beginners
6 pages
Matplotlib Exercises - Jupyter Notebook
No ratings yet
Matplotlib Exercises - Jupyter Notebook
7 pages
Biochem Molecular Bio Educ - 2023 - Garma - Demystifying Dimensionality Reduction Techniques in The Omics Era A
No ratings yet
Biochem Molecular Bio Educ - 2023 - Garma - Demystifying Dimensionality Reduction Techniques in The Omics Era A
14 pages
AC Compressor Manual
No ratings yet
AC Compressor Manual
48 pages
Properties of Matter Explained
No ratings yet
Properties of Matter Explained
112 pages
Doleh Sufian ch10 p23 Build A Model PDF Free
No ratings yet
Doleh Sufian ch10 p23 Build A Model PDF Free
6 pages
7 Rational Functions Equations Inequalities
No ratings yet
7 Rational Functions Equations Inequalities
27 pages
CESTAT30 - 01.01.introduction To The Course - Lecture
No ratings yet
CESTAT30 - 01.01.introduction To The Course - Lecture
8 pages
Experimental Study of Flow Past A Low-Rise Building: M. Mahmood
No ratings yet
Experimental Study of Flow Past A Low-Rise Building: M. Mahmood
18 pages
On Chemical Eqiulibrium For G-11
No ratings yet
On Chemical Eqiulibrium For G-11
44 pages
Digital Transmission
No ratings yet
Digital Transmission
48 pages
PK Valve Page-13-17
No ratings yet
PK Valve Page-13-17
5 pages
Tutorial 1
No ratings yet
Tutorial 1
18 pages
Casting Processes and Defects Quiz
No ratings yet
Casting Processes and Defects Quiz
10 pages
FMEA
100% (2)
FMEA
55 pages
X-Ray Repair and Its Maintenance
50% (2)
X-Ray Repair and Its Maintenance
99 pages
Legendre Polynomials Assignment
100% (1)
Legendre Polynomials Assignment
2 pages
MCQ Problems
100% (1)
MCQ Problems
7 pages
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
No ratings yet
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
9 pages
Oop DS U3
No ratings yet
Oop DS U3
17 pages
Pma - ks98 2 2 Us 1802 - Dat
No ratings yet
Pma - ks98 2 2 Us 1802 - Dat
10 pages
(聖經) Lvovsky2018 Book QuantumPhysics
100% (2)
(聖經) Lvovsky2018 Book QuantumPhysics
318 pages
Page (1) of
No ratings yet
Page (1) of
4 pages
Vedic Mathematics Lesson 1
No ratings yet
Vedic Mathematics Lesson 1
24 pages
PDC - Vortex - Xceed - Kuwait - Cs - ROP DATA PDF
No ratings yet
PDC - Vortex - Xceed - Kuwait - Cs - ROP DATA PDF
2 pages
U1L07 - Activity Guide - Apps With Storage
No ratings yet
U1L07 - Activity Guide - Apps With Storage
2 pages
Vector Algebra for Senior Students
No ratings yet
Vector Algebra for Senior Students
59 pages
Bochaver Et Al 20221687066806424
No ratings yet
Bochaver Et Al 20221687066806424
17 pages
Model Risk Tiering
100% (2)
Model Risk Tiering
32 pages
Cellular Respiration for Students
No ratings yet
Cellular Respiration for Students
15 pages
Quickstudy Laminated Reference Guides Nursing 1St Edition by Barchart 1423203089 978-1423203087
No ratings yet
Quickstudy Laminated Reference Guides Nursing 1St Edition by Barchart 1423203089 978-1423203087
15 pages
University Updates: Jawaharlal Nehru Technological University Hyderabad
No ratings yet
University Updates: Jawaharlal Nehru Technological University Hyderabad
11 pages

Spark DataFrames Project Exercise - Jupyter Notebook

Uploaded by

Spark DataFrames Project Exercise - Jupyter Notebook

Uploaded by

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Spark DataFrames Project Exercise

Start a simple Spark Session

from pyspark.sql import SparkSession

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

What are the column names?

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

What does the Schema look like?

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 1/7

Print out the first 5 columns.

for line in df.head(5):

Row(Date='2012-01-03', Open=59.970001, High=61.060001, Low=59.869999, Close=

Row(Date='2012-01-04', Open=60.209998999999996, High=60.349998, Low=59.47000

Row(Date='2012-01-05', Open=59.349998, High=59.619999, Low=58.369999, Close=

Row(Date='2012-01-06', Open=59.419998, High=59.450001, Low=58.869999, Close=

Row(Date='2012-01-09', Open=59.029999, High=59.549999, Low=58.919998, Close=

Use describe() to learn about the DataFrame.

|summary| Date| Open| High| Low|

| count| 1258| 1258| 1258| 1258|

| mean| null| 72.35785375357709|72.83938807631165| 71.9186009594594|7

| stddev| null| 6.76809024470826|6.768186808159218|6.744075756255496|

| min|2012-01-03|56.389998999999996| 57.060001| 56.299999|

| max|2016-12-30| 90.800003| 90.970001| 89.25|

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 2/7

from pyspark.sql.types import (StructField, StringType,

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

from pyspark.sql.functions import format_number

|summary| Open| High| Low| Close| Volume|

| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|

| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,780|

| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|

| max| 90.80| 90.97| 89.25| 90.47|80,898,100|

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 3/7

df_hv = df.withColumn('HV Ratio', df['High']/df['Volume']).select(['HV Ratio'])

only showing top 20 rows

What day had the Peak High in Price?

What is the mean of the Close column?

from pyspark.sql.functions import mean

What is the max and min of the Volume column?

from pyspark.sql.functions import min, max

How many days was the Close lower than 60 dollars?

df.filter(df['Close'] < 60).count()

In other words, (Number of Days High>80)/(Total Days in the dataset)

df.filter('High > 80').count() * 100/df.count()

What is the Pearson correlation between High and Volume?

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 5/7

What is the max High per year?

from pyspark.sql.functions import (dayofmonth, hour,

What is the average Close for each Calendar Month?

month_df = df.withColumn('Month', month(df['Date']))

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 6/7

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 7/7

You might also like