0% found this document useful (0 votes)

135 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

This document provides a summary of analyzing sales data from a furniture retailer. It includes: 1. Importing libraries and loading the sales data. Visualizing the data to analyze trends by product category, sub-category, region, and year. 2. Developing a predictive forecasting model for furniture sales. The time series data is preprocessed, decomposed to analyze trends and seasonality, and different ARIMA models are fit to the data. 3. The best performing ARIMA model is identified as SARIMAX(1,1,1)x(1,1,0,12) based on having the lowest AIC value, and will be used to forecast future furniture sales.

Uploaded by

sumaira khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views23 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

sumaira khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

PART II

First of all here we are going to import all required libraries and
them going to import or we can say load data and then going to
perform visualization on it.

Python’s code is below through which we made all these

visualization and analysis on data plus here the file attached
below double click on it to open or save it ( files ext is ipynb)

anjaliassignmnet.ipy
nb

 IMPPORTING LIBRARIES

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

 READING DATA FROM FILES

df=pd.read_excel('Sample - Superstore.xls')
df.head()
 CHECKING ROWS AND COLUMNS OF DATA

df.shape

 COLUMN NAMES OF TABLE

df.columns

 CHECKING DATA TYPES OF COLUMNS

df.dtypes

 CHECKING NULL VALUES

df.isnull().sum()

 Dropping Row ID column and assigning to df

df=df.drop('Row ID',axis=1)
df.head()
NOTE : Clearly the data is for US country only, so we can drop the
'Country' column as we dont need any analysis to be done based on it.
df['Country'].value_counts()
#dropping Country column
df=df.drop('Country',axis=1)
df.head()

WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS

1. PRODUCT LEVEL ANALYSIS

2. CUSTOMER LEVEL ANALYSIS
3. ORDER LEVEL ANALYSIS

df['Category'].unique()

#number of products in each category

df['Category'].value_counts()

#number of Sub-categories products are divided.

df['Sub-Category'].nunique()

#number of products in each sub-category

df['Sub-Category'].value_counts()

FIRST VISULAIZATION

plt.figure(figsize=(12,10))
df['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
NOTE: Highest profit is earned in Copiers while Selling price for
Chairs and Phones is extremely high compared to other products.
Another interesting fact- people dont prefer to buy Tables and
Bookcases from Superstore. Hence these departments are in loss.
SECOND

df.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('Total Profit and Sales per Sub-Category')
# plt.legend('Profit')
# plt.legend('Sales')
plt.show()
THIRD

DISTRIBUTION OF TOP 10 PRODUCTS.

plt.figure(figsize=(12,10))
df['Product Name'].value_counts().head(10).plot.pie(autopct="%1.1f
%%")

NOTE : People residing in Western part of US tend to order more

from superstore.
FOURTH :

Count of Sub-Category region wise

plt.figure(figsize=(15,8))
sns.countplot(x="Sub-Category", hue="Region", data=df)
plt.show()

To understand the data better. Lets create some new

columns like Cost,Profit%
df['Cost']=df['Sales']-df['Profit']
df['Cost'].head()

df['Profit %']=(df['Profit']/df['Cost'])*100

#Profit Percentage of first 5 product names

df.iloc[[0,1,2,3,4],[14,20]]
#Products with high Profit Percentage
df.sort_values(['Profit %','Product
Name'],ascending=False).groupby('Profit %').head(5)

LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL

df['Customer ID'].nunique()

#Top 10 customers who order frequently

df_top10=df['Customer Name'].value_counts().head(10)
df_top10

fig=plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
s=sns.countplot('Segment', data = df)
for s in ax.patches:
ax.annotate('{:.0f}'.format(s.get_height()), (s.get_x()+0.15,
s.get_height()+1))
plt.show()

#Top 20 Customers who benefitted the store

sortedTop20 = df.sort_values(['Profit'],
ascending=False).head(20)
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
p = sns.barplot(x='Customer Name',
y='Profit',hue='State',palette='Set1', data=sortedTop20,
ax=ax)
ax.set_title("Top 20 profitable Customers")
ax.set_xticklabels(p.get_xticklabels(), rotation=75)
plt.tight_layout()
plt.show()

Lets do some do some Analysis with Order details of the

data.

#number of unique orders

df['Order ID'].nunique()

#Calculating the time taken for an order to ship and

converting the no. of days in int format
df['Shipment Duration']=(pd.to_datetime(df['Ship Date'])-
pd.to_datetime(df['Order Date'])).dt.days
df['Shipment Duration']

df.iloc[:,[0,3,21]]

Lets find out some more details about each Customer like total
products purchased,Products they purchase,First Purchase Date,Last
Purchase Date,Location from where the Customer placed an order.
#creating function and appending customer and order info
to it.
def agg_customer(x):
d = []
d.append(x['Order ID'].count())
d.append(x['Sales'].sum())
d.append(x['Profit %'].mean())
d.append(pd.to_datetime(x['Order Date']).min())
d.append(pd.to_datetime(x['Order Date']).max())
d.append(x['Product Name'].unique())
d.append(x['City'].unique())
return pd.Series(d,
index=['#Purchases','Total_Sales','Average Profit %
gained','First_Purchase_Date','Latest_Purchase_Date','Produ
cts Purchased','Location_Count'])

#grouping based on Customer ID and applying the function

we created above
df_agg = df.groupby('Customer ID').apply(agg_customer)
df_agg

#extracting the year of order

df['order year']=df['Order Date'].dt.year
df['order year'].head()

#Calculating Profit gained in each Category

fig=plt.figure(figsize=(16,8))
ax = fig.add_subplot(111)
sns.barplot('order year','Profit %',hue='Sub-
Category',palette='Paired',data=df)
for o in ax.patches:
ax.annotate('{:.0f}'.format(o.get_height()), (o.get_x()+0.15,
o.get_height()+1))
plt.show()

NOTE : Sales of the store has increased every year resulting

in high profit margin by the end of 2017.
#Sales per year
df.groupby('order year')['Sales','Profit %'].agg(['sum']).plot.bar()
plt.title('Year wise Total Sales & % of profit gained')
Predictive forecasting model

Complete code is available in file below double click on icon to open it.

( ext is ipynb)

MODELING.ipynb

IMPORTING LIBRARIES

import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib

matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

There are several categories in the Superstore sales data, we start from
time series analysis and forecasting for furniture sales.

df = pd.read_excel('Sample - Superstore.xls')
furniture = df.loc[df['Category'] == 'Furniture']

We have a good 4-year furniture sales data.

furniture['Order Date'].min(), furniture['Order Date'].max()

Data Preprocessing

This step includes removing columns we do not need, check missing

values, aggregate sales by date and so on.

cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')

furniture.isnull().sum()

furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()

Indexing with Time Series Data

furniture = furniture.set_index('Order Date')

furniture.index
Our current datetime data can be tricky to work with, therefore, we will
use the averages daily sales value for that month instead, and we are
using the start of each month as the timestamp.

y = furniture['Sales'].resample('MS').mean()

y['2016':]

Visualizing Furniture Sales Time Series Data

y.plot(figsize=(15, 6))
plt.show()

Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the
beginning of the year and high at the end of the year. There is always an
upward trend within any single year with a couple of low months in the
mid of the year. We can also visualize our data using a method called
time-series decomposition that allows us to decompose our time series
into three distinct components: trend, seasonality, and noise.

from pylab import rcParams

rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(y, model='additive')

fig = decomposition.plot()
plt.show()

The plot above clearly shows that the sales of furniture is unstable,
along with its obvious seasonality.

Time series forecasting with ARIMA

We are going to apply one of the most commonly used method for
time-series forecasting, known as ARIMA, which stands for
Autoregressive Integrated Moving Average. ARIMA models are
denoted with the notation ARIMA(p, d, q). These three parameters
account for seasonality, trend, and noise in data:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d,
q))]

print('Examples of parameter combinations for Seasonal ARIMA...')

print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
This step is parameter Selection for our furniture’s sales ARIMA Time
Series Model. Our goal here is to use a “grid search” to find the optimal
set of parameters that yields the best performance for our model.

for param in pdq:

for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,

results.aic))
except:
continue
The above output suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the
lowest AIC value of 297.78. Therefore we should consider this to be
optimal option.

Fitting the ARIMA model

mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])

results.plot_diagnostics(figsize=(16, 8))
plt.show()
NOTE : our model diagnostics suggests that the model residuals
are near normally distributed.
Validating forecasts

To help us understand the accuracy of our forecasts, we

compare predicted sales to real sales of the time series, and we
set forecasts to start at 2017–01–01 to the end of the data.

pred = results.get_prediction(start=pd.to_datetime('2017-01-
01'), dynamic=False)
pred_ci = pred.conf_int()

ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead
Forecast', alpha=.7, figsize=(14, 7))

ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()

plt.show()

NOTE : The line plot is showing the observed values compared

to the rolling forecast predictions. Overall, our forecasts align
with the true values very well, showing an upward trend starts
from the beginning of the year and captured the seasonality
toward the end of the year.

Producing and visualizing forecasts

pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()

ax = y.plot(label='observed', figsize=(14, 7))

pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')

plt.legend()
plt.show()

NOTE : Our model clearly captured furniture sales seasonality.

As we forecast further out into the future, it is natural for us to
become less confident in our values. This is reflected by the
confidence intervals generated by our model, which grow
larger as we move further out into the future.
The above time series analysis for furniture makes me curious
about other categories, and how do they compare with each
other over time. Therefore, we are going to compare time
series of furniture and office supplier.

Time Series Analysis Time Series Analysis
No ratings yet
Time Series Analysis Time Series Analysis
5 pages
BIDA Practical Print
No ratings yet
BIDA Practical Print
56 pages
Wa0002.
No ratings yet
Wa0002.
4 pages
Time Series Analysis 1718649022
No ratings yet
Time Series Analysis 1718649022
5 pages
Time Series Analysis
No ratings yet
Time Series Analysis
5 pages
Saikat Dey Data Science Project
No ratings yet
Saikat Dey Data Science Project
14 pages
DVT Exp - 7
No ratings yet
DVT Exp - 7
11 pages
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
No ratings yet
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
9 pages
Amazon Sales Analysis
No ratings yet
Amazon Sales Analysis
3 pages
Mon 2 Week 3
No ratings yet
Mon 2 Week 3
5 pages
Lab Manual 4
No ratings yet
Lab Manual 4
23 pages
Forage 1
No ratings yet
Forage 1
9 pages
TSA Assignment
No ratings yet
TSA Assignment
2 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Seasonal Data - Seasonal ARIMA Models
No ratings yet
Seasonal Data - Seasonal ARIMA Models
13 pages
Assignment 3 Teleco Telecom Revenue - Copy1
No ratings yet
Assignment 3 Teleco Telecom Revenue - Copy1
33 pages
Matplotlib
No ratings yet
Matplotlib
4 pages
Supermarket Sales Insights
No ratings yet
Supermarket Sales Insights
8 pages
Business Report TSF - Rose DataSet
100% (4)
Business Report TSF - Rose DataSet
52 pages
Task 6
No ratings yet
Task 6
14 pages
Quantium Task 2
No ratings yet
Quantium Task 2
30 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Practical File Class 12 2025-26
No ratings yet
Practical File Class 12 2025-26
19 pages
Implementing K-Means Clustering: '/content/mall - Customers (1) .CSV'
No ratings yet
Implementing K-Means Clustering: '/content/mall - Customers (1) .CSV'
8 pages
Time Series Using Python
No ratings yet
Time Series Using Python
18 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Amazon Sales Forecasting Analysis
No ratings yet
Amazon Sales Forecasting Analysis
16 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Data Visualization for Students
No ratings yet
Data Visualization for Students
17 pages
Visualisation All
0% (1)
Visualisation All
70 pages
Time Series Analysis - CheatSheet
No ratings yet
Time Series Analysis - CheatSheet
10 pages
Supermart Grocery Sales Analysis
No ratings yet
Supermart Grocery Sales Analysis
8 pages
Final
No ratings yet
Final
10 pages
Retail Analysis Walmart
No ratings yet
Retail Analysis Walmart
18 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
Notes 20241025083428
No ratings yet
Notes 20241025083428
4 pages
Matplotlib
No ratings yet
Matplotlib
7 pages
INDEX
No ratings yet
INDEX
16 pages
Practical D.V
No ratings yet
Practical D.V
13 pages
Assignment: Master in Business Administration
No ratings yet
Assignment: Master in Business Administration
18 pages
Assignment 1 Supplementary
No ratings yet
Assignment 1 Supplementary
5 pages
Case Study Crude Oil Production Forecasting
No ratings yet
Case Study Crude Oil Production Forecasting
27 pages
Supermarket Sales Analysis and Prediction
100% (1)
Supermarket Sales Analysis and Prediction
34 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
Data Collection and Data Cleaning: Next Connect To The Drive
No ratings yet
Data Collection and Data Cleaning: Next Connect To The Drive
16 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
Pyhtonpractice Questions
No ratings yet
Pyhtonpractice Questions
5 pages
Matplotlib Pandas Guide
No ratings yet
Matplotlib Pandas Guide
7 pages
Statistics Project SEM1 Notes
No ratings yet
Statistics Project SEM1 Notes
5 pages
Set B
No ratings yet
Set B
8 pages
Data Visulization
No ratings yet
Data Visulization
2 pages
Document 11
No ratings yet
Document 11
6 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
Unit 1 Pandas - Charts
No ratings yet
Unit 1 Pandas - Charts
18 pages
Python for Business Analysts
No ratings yet
Python for Business Analysts
21 pages
Sales Analysis Assessment
No ratings yet
Sales Analysis Assessment
2 pages
Technologyname Phase2
No ratings yet
Technologyname Phase2
20 pages
COURSE OUTLINE-CSE 3310 Computer Graphics
No ratings yet
COURSE OUTLINE-CSE 3310 Computer Graphics
5 pages
R5105N Series Microprocessor Supervisor
No ratings yet
R5105N Series Microprocessor Supervisor
14 pages
GE Fanuc Automation: Series 16 I / 18i / 160i / 180i - Model A Series 21 I / 210i - Model A
No ratings yet
GE Fanuc Automation: Series 16 I / 18i / 160i / 180i - Model A Series 21 I / 210i - Model A
75 pages
4009 2807 PDF
No ratings yet
4009 2807 PDF
351 pages
Vimal Support2025 3
No ratings yet
Vimal Support2025 3
7 pages
T2 Option Price List October 2023
No ratings yet
T2 Option Price List October 2023
97 pages
Inggris Pas Gasal 22 23
No ratings yet
Inggris Pas Gasal 22 23
8 pages
Happy Numbers
No ratings yet
Happy Numbers
21 pages
Type Classification
No ratings yet
Type Classification
4 pages
CF-28 Service Repair Manual
No ratings yet
CF-28 Service Repair Manual
204 pages
Computer Diagnostics for G11 Students
No ratings yet
Computer Diagnostics for G11 Students
4 pages
Top-Down Programming Explained
No ratings yet
Top-Down Programming Explained
2 pages
KCAA/UAS/IP/20210108: Unmanned Aircraft System (UAS) Import Permit
No ratings yet
KCAA/UAS/IP/20210108: Unmanned Aircraft System (UAS) Import Permit
1 page
CHAPTER 2: Pointer Types and Arrays: Int PTR
No ratings yet
CHAPTER 2: Pointer Types and Arrays: Int PTR
5 pages
Interface
No ratings yet
Interface
7 pages
DMS (313302) Unit1
No ratings yet
DMS (313302) Unit1
75 pages
Man 8035 Ord Hand
No ratings yet
Man 8035 Ord Hand
1 page
Onkyo TX NR 616 Service Manual PDF
No ratings yet
Onkyo TX NR 616 Service Manual PDF
138 pages
Elon Musk v. Samuel Altman Et Al (Via CNBC - Com)
100% (3)
Elon Musk v. Samuel Altman Et Al (Via CNBC - Com)
46 pages
Business Processes in SAP S/4HANA Sales: Enterprise Structures
No ratings yet
Business Processes in SAP S/4HANA Sales: Enterprise Structures
6 pages
Automated Flow Lines
No ratings yet
Automated Flow Lines
17 pages
End Term
No ratings yet
End Term
6 pages
System Programming by Dhamdhere Text
No ratings yet
System Programming by Dhamdhere Text
456 pages
Changelog
No ratings yet
Changelog
4 pages
OrangeBox Start-Up EN
No ratings yet
OrangeBox Start-Up EN
37 pages
Indicator SNR + ICT HTF Candles With FVG For TradingView
100% (1)
Indicator SNR + ICT HTF Candles With FVG For TradingView
23 pages
Thunderbolt Technology: Mukesh Kumar Soni
No ratings yet
Thunderbolt Technology: Mukesh Kumar Soni
16 pages
Differences Between Quality Assurance and Quality Control - GeeksforGeeks
No ratings yet
Differences Between Quality Assurance and Quality Control - GeeksforGeeks
6 pages
Warehouse and Inventory Mangement (Module 3) - 1
No ratings yet
Warehouse and Inventory Mangement (Module 3) - 1
37 pages
Chess Master Club
No ratings yet
Chess Master Club
2 pages

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

DF PD - Read - Excel ('Sample - Superstore - XLS') : Anjaliassignmnet - Ipy NB

Uploaded by

PART II

Python’s code is below through which we made all these

 READING DATA FROM FILES

 COLUMN NAMES OF TABLE

 CHECKING DATA TYPES OF COLUMNS

 CHECKING NULL VALUES

 Dropping Row ID column and assigning to df

WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS

1. PRODUCT LEVEL ANALYSIS

#number of products in each category

#number of Sub-categories products are divided.

#number of products in each sub-category

DISTRIBUTION OF TOP 10 PRODUCTS.

NOTE : People residing in Western part of US tend to order more

Count of Sub-Category region wise

To understand the data better. Lets create some new

#Profit Percentage of first 5 product names

LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL

#Top 10 customers who order frequently

#Top 20 Customers who benefitted the store

Lets do some do some Analysis with Order details of the

#number of unique orders

#Calculating the time taken for an order to ship and

#grouping based on Customer ID and applying the function

#extracting the year of order

#Calculating Profit gained in each Category

NOTE : Sales of the store has increased every year resulting

We have a good 4-year furniture sales data.

furniture['Order Date'].min(), furniture['Order Date'].max()

This step includes removing columns we do not need, check missing

furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()

Indexing with Time Series Data

furniture = furniture.set_index('Order Date')

Visualizing Furniture Sales Time Series Data

from pylab import rcParams

decomposition = sm.tsa.seasonal_decompose(y, model='additive')

Time series forecasting with ARIMA

print('Examples of parameter combinations for Seasonal ARIMA...')

for param in pdq:

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,

Fitting the ARIMA model

To help us understand the accuracy of our forecasts, we

NOTE : The line plot is showing the observed values compared

Producing and visualizing forecasts

ax = y.plot(label='observed', figsize=(14, 7))

NOTE : Our model clearly captured furniture sales seasonality.

You might also like