PART II
First of all here we are going to import all required libraries and
them going to import or we can say load data and then going to
perform visualization on it.
Python’s code is below through which we made all these
visualization and analysis on data plus here the file attached
below double click on it to open or save it ( files ext is ipynb)
anjaliassignmnet.ipy
nb
IMPPORTING LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
READING DATA FROM FILES
df=pd.read_excel('Sample - Superstore.xls')
df.head()
CHECKING ROWS AND COLUMNS OF DATA
df.shape
COLUMN NAMES OF TABLE
df.columns
CHECKING DATA TYPES OF COLUMNS
df.dtypes
CHECKING NULL VALUES
df.isnull().sum()
Dropping Row ID column and assigning to df
df=df.drop('Row ID',axis=1)
df.head()
NOTE : Clearly the data is for US country only, so we can drop the
'Country' column as we dont need any analysis to be done based on it.
df['Country'].value_counts()
#dropping Country column
df=df.drop('Country',axis=1)
df.head()
WE CAN ANAKYZE THE DATA IN FURTHER 3 DIFFERENT WAYS
1. PRODUCT LEVEL ANALYSIS
2. CUSTOMER LEVEL ANALYSIS
3. ORDER LEVEL ANALYSIS
df['Category'].unique()
#number of products in each category
df['Category'].value_counts()
#number of Sub-categories products are divided.
df['Sub-Category'].nunique()
#number of products in each sub-category
df['Sub-Category'].value_counts()
FIRST VISULAIZATION
plt.figure(figsize=(12,10))
df['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
NOTE: Highest profit is earned in Copiers while Selling price for
Chairs and Phones is extremely high compared to other products.
Another interesting fact- people dont prefer to buy Tables and
Bookcases from Superstore. Hence these departments are in loss.
SECOND
df.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('Total Profit and Sales per Sub-Category')
# plt.legend('Profit')
# plt.legend('Sales')
plt.show()
THIRD
DISTRIBUTION OF TOP 10 PRODUCTS.
plt.figure(figsize=(12,10))
df['Product Name'].value_counts().head(10).plot.pie(autopct="%1.1f
%%")
NOTE : People residing in Western part of US tend to order more
from superstore.
FOURTH :
Count of Sub-Category region wise
plt.figure(figsize=(15,8))
sns.countplot(x="Sub-Category", hue="Region", data=df)
plt.show()
To understand the data better. Lets create some new
columns like Cost,Profit%
df['Cost']=df['Sales']-df['Profit']
df['Cost'].head()
df['Profit %']=(df['Profit']/df['Cost'])*100
#Profit Percentage of first 5 product names
df.iloc[[0,1,2,3,4],[14,20]]
#Products with high Profit Percentage
df.sort_values(['Profit %','Product
Name'],ascending=False).groupby('Profit %').head(5)
LETS LOOK AT THE DATA WRT TO CUSTOMER LEVEL
df['Customer ID'].nunique()
#Top 10 customers who order frequently
df_top10=df['Customer Name'].value_counts().head(10)
df_top10
fig=plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
s=sns.countplot('Segment', data = df)
for s in ax.patches:
ax.annotate('{:.0f}'.format(s.get_height()), (s.get_x()+0.15,
s.get_height()+1))
plt.show()
#Top 20 Customers who benefitted the store
sortedTop20 = df.sort_values(['Profit'],
ascending=False).head(20)
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
p = sns.barplot(x='Customer Name',
y='Profit',hue='State',palette='Set1', data=sortedTop20,
ax=ax)
ax.set_title("Top 20 profitable Customers")
ax.set_xticklabels(p.get_xticklabels(), rotation=75)
plt.tight_layout()
plt.show()
Lets do some do some Analysis with Order details of the
data.
#number of unique orders
df['Order ID'].nunique()
#Calculating the time taken for an order to ship and
converting the no. of days in int format
df['Shipment Duration']=(pd.to_datetime(df['Ship Date'])-
pd.to_datetime(df['Order Date'])).dt.days
df['Shipment Duration']
df.iloc[:,[0,3,21]]
Lets find out some more details about each Customer like total
products purchased,Products they purchase,First Purchase Date,Last
Purchase Date,Location from where the Customer placed an order.
#creating function and appending customer and order info
to it.
def agg_customer(x):
d = []
d.append(x['Order ID'].count())
d.append(x['Sales'].sum())
d.append(x['Profit %'].mean())
d.append(pd.to_datetime(x['Order Date']).min())
d.append(pd.to_datetime(x['Order Date']).max())
d.append(x['Product Name'].unique())
d.append(x['City'].unique())
return pd.Series(d,
index=['#Purchases','Total_Sales','Average Profit %
gained','First_Purchase_Date','Latest_Purchase_Date','Produ
cts Purchased','Location_Count'])
#grouping based on Customer ID and applying the function
we created above
df_agg = df.groupby('Customer ID').apply(agg_customer)
df_agg
#extracting the year of order
df['order year']=df['Order Date'].dt.year
df['order year'].head()
#Calculating Profit gained in each Category
fig=plt.figure(figsize=(16,8))
ax = fig.add_subplot(111)
sns.barplot('order year','Profit %',hue='Sub-
Category',palette='Paired',data=df)
for o in ax.patches:
ax.annotate('{:.0f}'.format(o.get_height()), (o.get_x()+0.15,
o.get_height()+1))
plt.show()
NOTE : Sales of the store has increased every year resulting
in high profit margin by the end of 2017.
#Sales per year
df.groupby('order year')['Sales','Profit %'].agg(['sum']).plot.bar()
plt.title('Year wise Total Sales & % of profit gained')
Predictive forecasting model
Complete code is available in file below double click on icon to open it.
( ext is ipynb)
MODELING.ipynb
IMPORTING LIBRARIES
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
There are several categories in the Superstore sales data, we start from
time series analysis and forecasting for furniture sales.
df = pd.read_excel('Sample - Superstore.xls')
furniture = df.loc[df['Category'] == 'Furniture']
We have a good 4-year furniture sales data.
furniture['Order Date'].min(), furniture['Order Date'].max()
Data Preprocessing
This step includes removing columns we do not need, check missing
values, aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')
furniture.isnull().sum()
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
Indexing with Time Series Data
furniture = furniture.set_index('Order Date')
furniture.index
Our current datetime data can be tricky to work with, therefore, we will
use the averages daily sales value for that month instead, and we are
using the start of each month as the timestamp.
y = furniture['Sales'].resample('MS').mean()
y['2016':]
Visualizing Furniture Sales Time Series Data
y.plot(figsize=(15, 6))
plt.show()
Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the
beginning of the year and high at the end of the year. There is always an
upward trend within any single year with a couple of low months in the
mid of the year. We can also visualize our data using a method called
time-series decomposition that allows us to decompose our time series
into three distinct components: trend, seasonality, and noise.
from pylab import rcParams
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(y, model='additive')
fig = decomposition.plot()
plt.show()
The plot above clearly shows that the sales of furniture is unstable,
along with its obvious seasonality.
Time series forecasting with ARIMA
We are going to apply one of the most commonly used method for
time-series forecasting, known as ARIMA, which stands for
Autoregressive Integrated Moving Average. ARIMA models are
denoted with the notation ARIMA(p, d, q). These three parameters
account for seasonality, trend, and noise in data:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d,
q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
This step is parameter Selection for our furniture’s sales ARIMA Time
Series Model. Our goal here is to use a “grid search” to find the optimal
set of parameters that yields the best performance for our model.
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal,
results.aic))
except:
continue
The above output suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the
lowest AIC value of 297.78. Therefore we should consider this to be
optimal option.
Fitting the ARIMA model
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
results.plot_diagnostics(figsize=(16, 8))
plt.show()
NOTE : our model diagnostics suggests that the model residuals
are near normally distributed.
Validating forecasts
To help us understand the accuracy of our forecasts, we
compare predicted sales to real sales of the time series, and we
set forecasts to start at 2017–01–01 to the end of the data.
pred = results.get_prediction(start=pd.to_datetime('2017-01-
01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead
Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
NOTE : The line plot is showing the observed values compared
to the rolling forecast predictions. Overall, our forecasts align
with the true values very well, showing an upward trend starts
from the beginning of the year and captured the seasonality
toward the end of the year.
Producing and visualizing forecasts
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
NOTE : Our model clearly captured furniture sales seasonality.
As we forecast further out into the future, it is natural for us to
become less confident in our values. This is reflected by the
confidence intervals generated by our model, which grow
larger as we move further out into the future.
The above time series analysis for furniture makes me curious
about other categories, and how do they compare with each
other over time. Therefore, we are going to compare time
series of furniture and office supplier.