0% found this document useful (0 votes)

19 views6 pages

Code Logic

Uploaded by

Aakash Kotkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views6 pages

Code Logic

Uploaded by

Aakash Kotkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Code Logic - Retail Data Analysis

In this document, you will describe the code and the overall steps taken to solve the project.

Commands Used in the project

project:
1. Command to create a directory in HDFS to store the time based KPIs
hadoop fs –mkdir
mkdir time_kpi

2. Command to create a directory in HDFS to be used a checkpoint while calculating time

based KPIs
hadoop fs –mkdir
mkdir time_kpi/checkpoint

3. Command to create a directory in HDFS to store the country based KPIs

hadoop fs –mkdir
mkdir country_kpi

4. Command to create a directory in HDFS to be used a checkpoint while calculating

country based KPIs
hadoop fs –mkdir
mkdir country_kpi/checkpoint

5. Spark-submit
submit command used
spark-submit --packages
packages org.apache.spark:spark
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5
10_2.11:2.4.5 spark-
spark
streaming.py > Console--output

Overall Steps taken to solve the project

project:
Python code which processes the streaming data from Kafka producer has the following logic
- Step 1: Import Spark libraries
- Step 2: Create Spark session
- Step 3: Declare and implement UDF(helper functions) to calculate additional columns.
columns
Following UDFs were created
o get_total_cost: Calculates total cost of the items in an order
o get_total_items:: Calculates total items in an order
o is_order: Determines if the order is an actuan order
o is_return: Determines if the order is a return
- Step 4:: Declare schema to read the data from Kafka Producer
- Step 5:: Read orders data from Kafka Producer using the schema
- Step 6: Call UDF functions to add following columns to the Spark dataframe
o total_cost:: Total cost of an order arriv
arrived
ed at by summing up the cost of all
products in that invoice
o total_items:: Total number of items present in an order
o is_order:: This flag denotes whether an order is a new order or not. If this invoice
is for a return order, the value should be 0.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

o is_return: This flag denotes whether an order is a return order or not. If this
invoice is for a new sales order, the value should be 0.
- Step 7: Write the input to console_output file generated for each one-minute
minute window.
- Step 8: Calculate the following KPIs for eac
each
h 1 minute window with a 10 minute
watermark, using aggregation functions available in Spark SQL functions
o Total volume of sales – Total sales made in a 1 minute window
o OPM (orders per minute) – Total orders made in a 1 minute window
o Rate of return – The rrate of returns in a 1 minute window
o Average transaction size – Average transaction size in terms of sales volume,
total orders and total returns in a 1 minute window
- Step 9:: Write KPIs calculated based on time window to time_kpi directory on HDFS
- Step 10: Calculate the following KPIs for each 1 minute window with a 10 minute
watermark per country basis, using aggregation functions available in Spark SQL
functions
o Total volume of sales – Total sales made in a 1 minute window grouped by
country
o OPM (orders perr minute) – Total orders made in a 1 minute window grouped by
country
o Rate of return – The rate of returns in a 1 minute window grouped by country
countr
- Step 11: Write KPIs calculated per country basis to country_kpi directory on HDFS

Python Code:
Step 1: Import Spark libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

Step 2: Create Spark session

# Create a Spark Session to process Streaming data
spark = SparkSession \
.builder \
.appName("StructuredSocketRead")
"StructuredSocketRead") \
.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

Step 3: Define UDFs to calculate additional columns

# 1. UDF to get total cost of an order
def get_total_cost(order_type, items):

total_cost = 0
for item in items:
total_cost
tal_cost = total_cost + (item['unit_price'] * item['quantity'])

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

if order_type == "ORDER":
return total_cost
else:
return total_cost * ((-1)

total_cost_udf = udf(get_total_cost, FloatType())

# 2. UDF to get total items in an order

def get_total_items(items):

total_items = 0

for item in iter(items):

total_items = total_items + item['quantity']

return total_items

total_items_udf = udf(get_total_items, IntegerType())

# 3. UDF to determine if the order is an actual order

def is_order(order_type):

if order_type == "ORDER":
return 1
else:
return 0

is_order_udf = udf(is_order, IntegerType())

# 4. UDF to determine if the order is a return

def is_return(order_type):

if order_type == "RETURN":
return 1
else:
return 0

is_return_udf = udf(is_return, IntegerType())

Step 4: Declare schema to read the data from Kafka Producer

# Schema to read the data fromt he Kafka Producer
schema = StructType() \
.add("type", StringType()) \
.add("country", StringType()) \

.add("invoice_no",
oice_no", LongType()) \
.add("timestamp", TimestampType()) \
.add("items", ArrayType(StructType() \
.add("SKU", StringType()) \
.add("title", StringType()) \
.add("unit_price", FloatType()) \
.add("quantity", IntegerType())
))

Step 5: Read orders

ers data from Kafka Producer using the schema
# Read data from Kafka Producer
orders = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","18.211.252.152:9092") \
.option("subscribe","real-time
time-project") \
.load()

# Wrangle
le the input data into the right columns
orders = orders.selectExpr("cast(value as string)") \
.select(from_json('value', schema).alias("value" )) \
.select("value.*")

Step 6: Call UDFs to add additional columns to the Spark dataframe

# Calculate following columns to help with the data analysys
# 1. total_cost: Total cost of an order arrived at by summing up the cost of all products in
that invoice
# 2. total_items: Total number of items present in an order
# 3. is_order: This flag denotes whether an order is a new order or not. If this invoice is for
a return order, the value should be 0.
# 4. is_return: This flag denotes whether an order is a return order or not. If this invoice is
for a new sales order, the value should be 0.

orders
ders = orders.withColumn("total_cost", total_cost_udf(orders.type, orders.items))
orders = orders.withColumn("total_items", total_items_udf(orders.items))
orders = orders.withColumn("is_order", is_order_udf(orders.type))
orders = orders.withColumn("is_retu
orders.withColumn("is_return", is_return_udf(orders.type))

Step 7: Write the input to console_output file generated for each one one-minute
minute window
# Calculating additional columns and writing the summarised input table to the console
orders_console= orders \
.select("invoice_no", "country",
ountry", "timestamp", "total_cost", "total_items", "is_order",
"is_return") \
.writeStream \

.outputMode("append") \
.format("console") \
.option("truncate", "False") \
.trigger(processingTime="1 minute") \
.start()

Step 8: Calculate the time based sed KPIs for each 1 minute window
# Calculating following time-based
based KPIs with a watermark of 10 minutes and a tumbling window
of 1 minute
# 1. Total volume of sales
# 2. OPM (orders per minute)
# 3. Rate of return
# 4. Average transaction size
orders_time_based_kpi= orders \
.withWatermark("timestamp","10 minutes") \
.groupby(window("timestamp", "1 minute")) \
.agg(count("invoice_no").alias("OPM"),
sum("total_cost").alias("total_sale_volume"),
sum("is_order").alias("total_o
sum("is_order").alias("total_orders"),
sum("is_return").alias("total_returns")) \
.select("window","OPM","total_sale_volume","total_orders","total_returns")

orders_time_based_kpi = orders_time_based_kpi.withColumn("rate_of_return",
(orders_time_based_kpi.total_returns /(o
/(orders_time_based_kpi.total_orders +
orders_time_based_kpi.total_returns)))
orders_time_based_kpi = orders_time_based_kpi.withColumn("average_transaction_size",
(orders_time_based_kpi.total_sale_volume /(orders_time_based_kpi.total_orders +
orders_time_based_kpi.total_returns)))
d_kpi.total_returns)))

Step 9: Write KPIs calculated based on time window to time_kpi directory on HDFS
# Write time based KPI values to JSON files
time_based_kpi = orders_time_based_kpi \
.select("window", "OPM", "total_sale_volume", "rate_of_return",
"average_transaction_size") \
.writeStream \
.format("json") \
.outputMode("append") \
.option("truncate", "false") \
.option("path", "time_kpi/") \
.option("checkpointLocation", "time_kpi/checkpoint/") \
.trigger(processingTime="1 minutes") \
.start()

Step 10: Calculate per country based KPIs for each 1 minute window with a 10 minute
# Calculating following country-based
based KPIs with a watermark of 10 minutes and a tumbling
window of 1 minute
# 1. Total volume of sales
# 2. OPM (orders per minute)
# 3. Rate of return
orders_country_based_kpi= orders \
.withWatermark("timestamp","10 minutes") \
.groupby(window("timestamp", "1 minute"), "country") \
.agg(count("invoice_no").alias("OPM"),
sum("total_cost").alias("tota
sum("total_cost").alias("total_sale_volume"),
sum("is_order").alias("total_orders"),
sum("is_return").alias("total_returns")) \
.select("window", "country", "OPM","total_sale_volume","total_orders","total_returns")

orders_country_based_kpi = orders_country_based_
orders_country_based_kpi.withColumn("rate_of_return",
kpi.withColumn("rate_of_return",
(orders_country_based_kpi.total_returns /(orders_country_based_kpi.total_orders +
orders_country_based_kpi.total_returns)))

Step 11: Write KPIs calculated per country basis to country_kpi directory on HDFS
# Write time based KPI values
country_based_kpi = orders_country_based_kpi \
.select("window", "country", "OPM", "total_sale_volume", "rate_of_return") \
.writeStream \
.format("json") \
.outputMode("append") \
.option("truncate", "false") \
.option("path",
"path", "country_kpi/") \
.option("checkpointLocation", "country_kpi/checkpoint/") \
.trigger(processingTime="1 minutes") \
.start()

country_based_kpi.awaitTermination()

Code Explanation
No ratings yet
Code Explanation
3 pages
SCD 1,2,3
No ratings yet
SCD 1,2,3
4 pages
Business - Requirements 2nd Project
No ratings yet
Business - Requirements 2nd Project
6 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
No ratings yet
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
10 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
First Pyspark
No ratings yet
First Pyspark
18 pages
Sparktuning
No ratings yet
Sparktuning
10 pages
Project Template Notebook Ipynb 1
No ratings yet
Project Template Notebook Ipynb 1
23 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Computer Practical Report Certification
No ratings yet
Computer Practical Report Certification
32 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Code Feature
No ratings yet
Code Feature
7 pages
Saprk
No ratings yet
Saprk
1 page
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Pyspark SQL Practice Questions No Window
No ratings yet
Pyspark SQL Practice Questions No Window
2 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Fabric Notes
No ratings yet
Fabric Notes
5 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
# Initialize Menu, Inventory, Orders, Customers, Staff, Logistics, and Finances
No ratings yet
# Initialize Menu, Inventory, Orders, Customers, Staff, Logistics, and Finances
9 pages
QSTN5
No ratings yet
QSTN5
1 page
Lab Session 3
No ratings yet
Lab Session 3
5 pages
Food Processing System Project
No ratings yet
Food Processing System Project
15 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Day 28
No ratings yet
Day 28
5 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
Day 27
No ratings yet
Day 27
6 pages
HOL Hive
No ratings yet
HOL Hive
85 pages
ASP3
No ratings yet
ASP3
6 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Journal
No ratings yet
Journal
47 pages
BDA MakeUp Solution
No ratings yet
BDA MakeUp Solution
7 pages
PySpark Code Quality Guide
No ratings yet
PySpark Code Quality Guide
4 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Py Spark
No ratings yet
Py Spark
7 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Big Data Group - Project
No ratings yet
Big Data Group - Project
24 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Lab4 - 22130037 - TruongTanDat - Colab
No ratings yet
Lab4 - 22130037 - TruongTanDat - Colab
3 pages
Snowpark ETL Setup and Data Loading Guide
No ratings yet
Snowpark ETL Setup and Data Loading Guide
34 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
2023 08 20 17 57 52 SCC 2023-08-20-17-58-01
No ratings yet
2023 08 20 17 57 52 SCC 2023-08-20-17-58-01
83 pages
Solution
No ratings yet
Solution
4 pages
Interview
No ratings yet
Interview
2 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Low Level Design
No ratings yet
Low Level Design
8 pages
Online Sales Data Analysis
No ratings yet
Online Sales Data Analysis
9 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
FOOD ORDER With BONAFIED
No ratings yet
FOOD ORDER With BONAFIED
16 pages
RDD
No ratings yet
RDD
4 pages
Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
DSLab2
No ratings yet
DSLab2
6 pages
Seeds
No ratings yet
Seeds
2 pages
Nuts
No ratings yet
Nuts
2 pages
Fruits
No ratings yet
Fruits
2 pages
Vegetables
No ratings yet
Vegetables
2 pages
Shrey Patel Resume
No ratings yet
Shrey Patel Resume
1 page
PC Soft Webdev Databses Auto Backups at
No ratings yet
PC Soft Webdev Databses Auto Backups at
2 pages
How To Do SAP System Copy Step-By-step Post Processing
100% (2)
How To Do SAP System Copy Step-By-step Post Processing
14 pages
Dweller Naming Systems
No ratings yet
Dweller Naming Systems
1 page
Entry-Level Python & SQL Developer Resume
No ratings yet
Entry-Level Python & SQL Developer Resume
2 pages
Introduction of DBMS (Database Management System)
No ratings yet
Introduction of DBMS (Database Management System)
4 pages
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
No ratings yet
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
39 pages
Oracle - Actualtests.1z0 448.v2018!11!26.by - Colin.49q
No ratings yet
Oracle - Actualtests.1z0 448.v2018!11!26.by - Colin.49q
22 pages
ASM Configuration Best Practices
No ratings yet
ASM Configuration Best Practices
3 pages
EX200 Fast Track Exam Prep
No ratings yet
EX200 Fast Track Exam Prep
17 pages
ESS Who's Who Configuration
100% (1)
ESS Who's Who Configuration
13 pages
DBMS Unit 5 Arti Kak
No ratings yet
DBMS Unit 5 Arti Kak
15 pages
Big Data Analytics Exam 2020
100% (1)
Big Data Analytics Exam 2020
10 pages
BCom (BusinessAnalytics) I Sem Pratical - QBank
No ratings yet
BCom (BusinessAnalytics) I Sem Pratical - QBank
3 pages
Backup4all User Manual PDF
No ratings yet
Backup4all User Manual PDF
416 pages
SQL Complex Queries
No ratings yet
SQL Complex Queries
9 pages
Foundations of Business Intelligence (BI) From Concept To Implementation
No ratings yet
Foundations of Business Intelligence (BI) From Concept To Implementation
75 pages
SAN & Cloud Storage Insights
No ratings yet
SAN & Cloud Storage Insights
96 pages
Video Cases: Case 1: Maruti Suzuki Business Intelligence and Enterprise Databases
No ratings yet
Video Cases: Case 1: Maruti Suzuki Business Intelligence and Enterprise Databases
33 pages
DBA Morning Checklist Guide
No ratings yet
DBA Morning Checklist Guide
2 pages
Minify UnminifyAll
No ratings yet
Minify UnminifyAll
9 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
18cs0506-Database Management Systems
No ratings yet
18cs0506-Database Management Systems
5 pages
SCADA Project Guide - 27
No ratings yet
SCADA Project Guide - 27
1 page
Typs of Memory
No ratings yet
Typs of Memory
14 pages
Einstein Analytics: Sfdcregister Transformation
No ratings yet
Einstein Analytics: Sfdcregister Transformation
1 page
没有行的列表分配给sobject的帮助
100% (2)
没有行的列表分配给sobject的帮助
8 pages
Harshad Resume New
No ratings yet
Harshad Resume New
3 pages
IMS DB StudentWorkBook
100% (3)
IMS DB StudentWorkBook
60 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages

Code Logic

Uploaded by

Code Logic

Uploaded by

Code Logic - Retail Data Analysis

Commands Used in the project

2. Command to create a directory in HDFS to be used a checkpoint while calculating time

3. Command to create a directory in HDFS to store the country based KPIs

4. Command to create a directory in HDFS to be used a checkpoint while calculating

Overall Steps taken to solve the project

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

Step 2: Create Spark session

Step 3: Define UDFs to calculate additional columns

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

total_cost_udf = udf(get_total_cost, FloatType())

# 2. UDF to get total items in an order

for item in iter(items):

total_items_udf = udf(get_total_items, IntegerType())

# 3. UDF to determine if the order is an actual order

is_order_udf = udf(is_order, IntegerType())

# 4. UDF to determine if the order is a return

is_return_udf = udf(is_return, IntegerType())

Step 4: Declare schema to read the data from Kafka Producer

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

Step 5: Read orders

Step 6: Call UDFs to add additional columns to the Spark dataframe

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

You might also like