Syllabus
Introduction to Single variable : Distributions and Variables - Numerical Summaries of
Level and Spread - Scaling and Standardizing - Inequality - ‘Smoothing Time Series.
 
Univariate Analysis
 
 
  
 
 
 
 
    
   
     
     
34
Contents
Introduction to Single Variable
Storing and Importing Data using Python
Numerical Summaries of Level and Spread
Scaling and Standardizing
Time Series and Smoothing Time Series
  
Two Marks Questions with Answers
aeUnivariate
Data Exploration and Visualization (3-2) Analy
£Q 3.1 Introduction to Single Varlable
© Exploratory data analysis is cross-classified in two different hae where each meting
either graphical or non-graphical and then, each method is either univariate, bivariag
multivariate.
* Univariate analysis is simplest analysis of statistical data, The term univariate Analysiy
refers to the analysis of one variable. The prefix. “uni” means “one.” The purpose op
univariate analysis is to understand the distribution of values for a single variable
Univariate analysis explores each variable in a data set, separately.
¢ In other words in univariate analysis data has only one variable. It doesn’t deal with Causes
or relationships and it’s major purpose is to describe data; it takes data, summarizes thay
data and finds patterns in the data.
 
* ‘Some ways one can describe patterns found in univariate data include central tendency
(mean, mode and median) and dispersion : Range , variance, maximum, minimum, quartiles
(including the interquartile range) and standard deviation.
* Univariate analysis works by examining the effects of a singular variable on a set of data,
For example, a frequency distribution table is a form of univariate analysis as frequency ig
the only variable being measured. Alternative variables may be age, height, weight, etc,
however it is important to note that as soon as a secondary variable is introduced it becomes
bivariate analysis. With three or more variables, it becomes multivariate analysis,
@ 3.1.1 univariate Statistics
* Univariate analysis can be performed in a statistical setting. Two types of statistics can be
used for analysis namely, descriptive and inferential,
Descriptive statistics
° As the name suggests, descriptive statistics are used to describe data. The statistics used
here are commonly referred to as summary statistics.
* Descriptive, statistics can be used for calculating things like missing value proportions,
upper and lower limits for outliers, level of variancé through the coefficient of variance, etc.
Inferential statistics ©
+* Often, the data one is dealing with is a subset (sample) of the complete data (population).
Thus, the common question here is - i
© Can the findings of the sample be extrapolated to the population ? That is, is the sample
representative of the population or has the Population changed ? Such questions are
answered using specific hypothesis tests designed to deal with such univariate data-
based problems.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeota Exploration and Visualization @
-9)
a) Univariate Analysis
ce Hypothesis tests help to answer crucii .
population from where they eee about the data and their relation with the
: awn, oa ;
F echanisms come in handy here, such as - Several hypotheses or univariate testing
1. Z Test - Used for numerical (quantitati
; (quantitative) data where th ize i cH
and the population’s standard deviation is known. ae
2, One-Sample t-Test - ‘ :
as "0 a si oe for numerical (quantitative) data where the sample size is
jess than 30 or the population’s standard deviation is unknown.
3, Chi-Square Test - Used with ordinal categorical data.
4, Kolmogoroy-Smirnoy Test - Used with nominal categorical data.
‘There are below common methods for performing univariate analysis,
1, Summary statistics
2, Frequency distributions
3. Charts
4, Univariate tables.
4. Summary statistics
‘The most common way to perform the univariate analysis is to use summary statistics to
describe a variable. There are two kinds of summary statistics :
.se values describe where the dataset's center or
1, Measures of central tendency - The:
middle value is located. The mean, mode and median are examples.
These numbers describe how evenly distributed the values
2. Dispersion measures -
dard deviation and variance are some examples.
» gre in the dataset. The range, stan
The shape of the data distribution can explain a great deal
help ‘in identifying the type of distribution followed
specific properties that can be used to
IL-know if the data is symmetrical,
or negative kurtosis,
3, Measure of shape -
about the data as the shape can
by the data. Each of these distributions has s|
one’s advantage. By analyzing the shapes, one wil
non-symmetrical, left or right-skewed, is suffering from positive
among other things.
2. Frequency distributions
s occur in a dataset.
© A frequency distribution describes how frequently different values
This acts as another way to perform univariate analysis.
TECHNICAL PUBLICATIONS® -@” paths for knowledgeUnivari
Data Exploration and Visualization (3-4) (ariate Analyei,
3. Charts
 
Another method for performing univariate analysis is to create charts that show th,
distribution of values for a specific variable.
Various types of graphs can be used to understand data, The standard type of graphs
include -
1. Histograms : A histogram displays the frequency of each value or group of values
(bins) in numerical data, This helps in understanding how the values are distributed,
 
Rr
. Boxplot : A boxplot provides several important information such as minimum,
maximum, median, 1" and 3" quartiles. It is beneficial in identifying outliers in the
data.
3. Density curve : The density curve helps in understanding the shape of the data’s
distribution. It helps answer questions such as if the data is bimodal, normally
distributed, skewed, etc.
4, Bar chart : Bar charts, mainly frequency bar charts, is a univariate chart used to find
the frequency of the different categories of categorical data.
5. Pie chart : Frequency Pie charts convey similar information to bar charts. The
difference is that they have a circular formation with each slice indicating the share
of each category in the data.
4. Univariate tables
Tables help in univariate analysis and are typically used with categorical data or
numerical data with limited cardinality. Different types of tables include :
1. Frequency tables : Each unique value and its respective frequency in the data is
shown through a table. Thus, it summarizes the frequency the way a histogram,
frequency bar or pie chart does but in a tabular manner.
2. Grouped tables : Rather than finding the count of each unique value, the values are
_ binned or grouped and the frequency of each group is reflected in the table. It is
typically used for numerical data with high cardinality.
"3, Percentage (Proportion) tables : Rather than showing the frequency of the unique
values (or groups), such a table shows their proportion in the data (in percentage).
4. Cumulative proportion tables : It is similar to the proportion table, with the
difference being that the proportion is shown cumulatively. It is typically used with
binned data having a distinct order (or with categorical ordinal data).
 
TECHNICAL PUBLICATIONS® - an up-thnust for knowledge= ee
oration and Visualization
pata Exp (3-5) Univariate Analysis
3.1.2 Variable and Distribution in Univariate Analysis
‘A variable in univariate analysis is a condition or subset that data falls into. Variable can be
thought of as a “category.” For example, the analysis might work on a variable “height” or
it might work on “weight”. Univariate analysis can be carried out on any of the individual
variables in the dataset to gain a better understanding of its distribution of values.
univariate data examples
The salaries of employees in a specific industry; the variable in this example is employee's
salaries. a
The heights of ten students in a class are measured; the variable here is the student's
heights.
‘A veterinarian wants to weigh 20 cats; the variable, in this case, is the weight of the cats.
o Finding the average height of a country’s men from a sample.
© Calculate how reliable a batsman is by calculating the variance of their runs.
© Finding which country is the most frequent in winning Olympic Gold Medal by creating
a frequency bar chart or frequency table.
o Understanding the income distribution of a county by analyzing the distribution’s shape.
‘A right-skewed distribution can indicate an unequal society.
Checking if the price of sugar has statistically significantly risen from the generally
accepted price by using sample survey data. Hypothesis tests such as the Z or t-test solve
such questions.
© Assessing the predictive capability of a vari
able by calculating the coefficient of
variance.
Distribution and variables
Types of variables : Variables can be one of two types : Categorical or numerical.
Categorical Data
Categorical data classify items into groups. This type of data
can be further broken down
into nominal, ordifial and binary values.
© Ordinal values have a set order. An examp!
© Nominal values have no set order. Examples inclu
alignment.
© Binary data has only two values. This could be represent
Je here could be a ranking of low to high.
de the superhero’s gender and
ted as true / false or 1/0.
TEGHINIGAL PUBLIGATIONS® - an uptust fr knowiedeData Exploration and Visualization (3-6) Univariate Anaiys,
* A common way to summarize categorical variables is with a frequency table,
* Columns holding categorical data : Gender, Married, BankCustomer, Indus
Ethnicity, PriorDefault, Employed, DrivingLicense, Citizen, Approved.
Numerical data
° Numerical data are values that one can perform mathematical operations on, They are
further broken down into continuous and discrete data types.
© Discrete variables have to. be an integer. An example is number of superheroes.
© Continuous can be any value. Examples here include height and weight.
Numerical data can be visualized with a histogram. Histograms are a great first analysis of
continuous data. Four main aspects to consider here are shape, center, spread and outliers,
© Shape is the overall appearance of the histogram. It can be symmetric, skewed, uniform
or have multiple peaks. p
© Center refers to the mean or median.
© Spread refers to the range or how far the data reaches.
© Outliers are. data points that fall far from the bulk of the data.
© Columns holding numerical and continuous data : Age, debt, YearsEmployed,
CreditScore, Income.
QQ 3.2 Storing and Importing Data using Python
© There are various methods to import data in Python. One of the way to import data is using
Pandas library. :
¢ In most simplest form data can be stored in a CSV file. CSV stands for “Comma Separated
Values.” It is the simplest form of storing data in tabular form as plain text.
* CSV file structure is very simple in which, the first line of a CSV file is the header and
contains the names of the fields/features separated by “comma”. After the header, each line
of the file is an data set value/observation/record. The values of a record are separated by
“comma.”
Steps to import using Pandas
1. Get the correct and full file path
« Firstly, capture the full path where CSV file is stored.
* For example, a CSV file is stored under the following path
(ES Piresk Taata\mytoamicev Bae RS
  
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgejoration and Visualization
pata EXP!
(3-7)
Univariate Analysis:
File name - It should mad
je sure
cee that the file name specified in the cod
1e code matches with the
File extension - The fil :
ile extension should always be ‘.csv’
example program - 4 ys be ‘.csv’ when importing CSV files.
poy # Python code to import the ble
   
import pandas as pd
fread the csv file (put '' b
fore the path strin ¥
path, ‘such as \). 6 path string to address any special #characters in
Sspaneadecey D\ Tech: det=\uy ean oe
print (af)
Example program 1 - Output
 
 
  
“Name City plays
Lucky Pune —sCricket
‘Aniket Mumbai Cyclist
  
 
  
  
  
  
 
  
  
    
   
 
   
Lahu Kolhapur Tennis
Shital Amarawati Badminton
Nashik Cricket
 
‘Monit Ratnagiri Tennis
Jaya “pnagar Badminton’
“Dev © paghangari Cyclist
“anita Sasi acon:
Badminton!
 
 
 
/ Rashmi ©) Dhule eat
of the distribution and of
rative variable that are of
tof dispersion in the
‘number used t0
tion for 4 quantit
s the amount
primary interest are the seal summary is &
distribution and the shape
describe a specific characteristic about(3-8) _ Univariate
7 At
 
Data Expioration and Visualisation
Below are some of the useful numerical summaries
© Center : Mean, median, mode
five number summaries
 
 
© Quantiles : Percentile
© Spread : Standard deviation, variance, interquartile range
© Outliers
© Shape : Skewness, kurtosis
© Concordance : Correlation, quantite-quantile plots.
Mean
‘© This is the point of balance, describing the most typical value for normally distributed data,
By “normally distributed” data it means it is highly influenced by outliers.
© The mean adds up all the data values and divides by the total number of values, as follows
  
 
© The ‘x-bar’ is used to represent the sample mean (the mean of a sample of data)
 (sigma) implies the addition of all values up from ‘i=1” until ‘i=n’ (’n’ is the number of
data values). The result is then divided by ‘n’.
Median
This is the “middle data point”, where half of the data is below the median and halfis above
the median. It’s the 50 percentile of the data It’s also mostly used with skewed data
because outliers won't have a big effect on the median.
Theré are two formulas to compute the median. The choice of which formula to ust
depends on n (number of data points in the sample or sample size) if it’s even or odd.
 
Xa) X(n/
Median = —2/ alae)
When n is even, there is no “middle” data point, so the middle two values are averaged.
Median = X¢q41)/2)
When n is odd, the middle data point is the median.
Mode
© The mode returns the most commonly occurring data value.>
pote Exploration and Visualization
 
Univartate Analysis
percentile
ercent of data that i
be . poh oe oe to or less than a given data point. It’s useful for describing
4 within the data set. If the percentile is close to zero, then the
observation is one of the smallest. If the percentil en the data point is one
7 rc is int i
of the largest in the data set. entile is close to 100, then tl int i
quartiles (Five-number summary)
. ut eed center and it’s also great to describe the spread of the data. Highly
usel pee e a. hers are four quartiles and they compose the five-number
summary(combined with the minimum). The Five-number summary is ‘composed of :
1. Minimum
th 2
25 percentile (lower quartile)
th
50"" percentile (median)
ae PY
th
75"" percentile (upper quartile)
th
5, 100" percentile (maximum)
Standard deviation
Standard deviation is extensively used in statistics and data science. It measures the amount
of variation or dispersion of a data set, calculating how spread out the data are from the
mean. Small values mean the data is consistent and close to the mean. Larger values
indicate the data is-highly variable.
Deviation : The idea is to use the mean as a reference point from which everything varies.
distance an observation lies from the reference point. This
‘A deviation is defined as the
distance is obtained by subtracting the data point (x,) from the mean (x-bar).
 
all the deviations will always turn
‘um up the results. Then, one can
result to undo
‘on : The average of
ch deviation and si
Further, square root the final
Calculating the standard deviati
n square 21
f freedom).
out tobe zero, so one cal
divide it for ‘n— 1” (called degrees oO
the squaring of the deviations.
of all deviations in the data. It’s never negative
© The standard deviation is a representation
and it’s zero oily if-all the values are the sem.
TECHNICAL PUBLIGATIONS® - an upthrust or knowledgeData Exploration and Visualization (3-19) 5 Univariate Araya,
Variance
*° Variance is almost the same calculation of the standard deviation, but it stays in SQuareg
units. So, if taken the square root of the variance, one gets the standard deviation. Note tha
it’s represented by ‘s-squared’, while the standard deviation is represented by ‘s’,
 
a 2
2 Bay %-%)
n=l
Range
* The difference between the maximum and minimum values. Useful for some basic
exploratory analysis, but not as powerful as the standard deviation.
 
¥ ay Xay
Proportion '
© It’s often referred to as “percentage”. Defines the percent of observations in the data set that
satisfy some requirements.
Bix
p=
Correlation
° Defines the strength and direction of the association between two quantitative variables. It
ranges between — | and 1. Positive correlations mean that one variable increases as the
other variable increases. Negative correlations mean that one variable decreases as the other
increases. When the correlation is zero, there is no correlation at all. As closest to one of the
extreme the result is, stronger is the association between the two variables.
fa
STE SE STITT
 
 
 
import numpy as np 5 j
df = pd DataFrame({(‘Indian Cinema’, ‘Restaurant’, 289.0),
(RamKrishna’, ‘Restaurant’, 224.0),
5 (Zingo’, ‘Juice bar, 80.5), : e
—_(The Place’, 'Play Club’, np.nan)J, Ee = a
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeexploration and Visualization
  
pate
gaamnsetmame’ ‘type’, ‘AvgBil!')
)
print
fpintcavol Mean = ', dfl'AvgBill.moan())
faint AvoBi Median = ', df'AvgBill'|.median())
ple program - 2 Output
Exam!
roane type AvgBill
| jedian Cinema Restaurant 289.0
pomKrishna Restaurant — 2240
ingo Juice bar 80.5
‘the Place Play Club NaN
lavgBill Mean = 197.89393333333934
[avgBill Median = 224.0
Example program - 3
{import pandas as pd
data = pd.DataFrame({'col1'
‘col2'st'y’, x, se, '2', x, yy’, "2, oe, "2 Se],
 
   
 
‘group'('A', ‘CB, 'B',
pprint(data)
pptint('Col2 Mode =", datal'col1'.mode())
Example program - 3 Output
[Feoll col2 group
0 5 y A
fe 2 x c.
Be 7 eo B
Bee 3 2 oa B
4 4 x A
5B 40 sy c
b 2 y A
Bees A
B22 c
Bed x B
Bt 2 UB
Be 5 x A
Pel2Mode= 0 2
‘dtype; intea
Univariate Analysis
5, 2,7, 3, 4, 4, 2,3, 2,1,2,5],  # Create pandas DataFrame.
 
TEGHNIGAL PUBLIGATIONS® - an up-thrust for knowiodgeData Exploration and Visualization (3-12) Univariate Ana,
alysis
Example program - 4
# calculate a 5-number summary
from numpy import percentile
from numpy.random import rand
# generate data sample
data = rand(1000)
¥ calculate quartiles
quartiles = percentile(data, [25, 50, 75])
# calculate min/max
(date_min, data_max = data.min(), data.max()
# print 5-number summary
print(Min: %.3f % data_min)
pprint(Qt: %.3f % quartiles|0})
print(Median: %.3f % quartiles[1])
‘print(03: %.3f % quartiles|2])
[print(Max: %.3f.% data_max)
Example program - 4 Output
 
‘Min: 0.001
Q1: 0.269
“Median : 0.509
03: 0.762
‘Max: 0.999
  
    
   
QQ 3.4 Scaling and Standardizing
Feature scaling (also known as data normalization) is the method used to standardize the
range of features of data. Since, the range of values of data may vary widely, it becomes
necessary step in data preprocessing while exploring and visualizing data.
© Scaling of data may be useful and/or necessary under certain circumstances (e.2. Wte?
variables span different ranges). There are several different versions of scaling, the most
important of which are listed below. Scaling procedures may be applied to the full é"
matrix or to parts of the matrix only (e.g. column-wise).
 
TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge=
ia Exploration and Visualization
al
 
Univariate Analysis
ange sealing
Range scaling eae the values to another range which usually includes both a shift
and a change of U : scale (magnification or reduction). In scaling (also called min-max
sealing), one transforms the data such that the features are within a specific range
eg: [0, 1)
scaling is important in the algorithms stich as Support Vector Machines (SVM) and
jenearest neighbors (KNN) where distance between the data points is important. For
example, in the dataset containing prices of products; without scaling, SVM might treat
1 € equivalent to 1 Euro though 1 Euro = 90 INR.
‘The data samples are transformed according to the following equation :
   
D,
Dra ‘max
Fig. 3.4.1 Range scaling
Bunax~ Rin, Rmin Prax Bmax Prin
Dynax : Dinin
 
7
 
(port numpy as mp 2 g aR a
Pore
{mport matplotlib pyplot as plt : i
‘rom skleam.preprocessing import minmax_scale, scale j
 
‘set seed for reproducibility
prandom.seed(0)
j
|
/
: : i
Generate random data points from an exponential distribution |
= np.random.exponential(size=1000)
mix-max scaling i
Caled data =\minmax_scale(x) iia -
TaTions®- an upthrust for knowledgeData Exploration and Visualization (3-14)
Wiecaled data = (x-x.min())/(x.max()-x.min())
# plot both together to compare
1, ax = plt.subplots(1,2)
 
‘sns.distplot(x, ax=ax{0], color=
‘ax{0}.s0t_title("Original data")
[sns.diotplot(scaled_data, ax=ax|1), color="r)
ax{1].set_title("Scaled data")
plt.show()
')
Example program -5 Output
Original data
 
00 251. 50 7.6
Scaled data
05
Fig. 3.4.2 Original and scaled data
Mean centering
* Subtracting the mean of the data is often called "mean centering". It results in a shift of the
data towards the mean. The mean of the transformed data thereafter equals zero :
Y= X-p
. Standardization and Normalization
* Standardization (sometimes also called autoscaling or z-transformation, z-score
j malization) is the scaling procedure which results in a zero mean and unit variance of
any descriptor variable. For every data value the mean 1 has to be subtracted and the result
has to be divided by the standard deviation o (note that the order of these two operations
 
must not be reversed) :
y = Ky
o
TECHNICAL PUBLIGATIONS®- on up-hnist or knowodgo
 
1.0
 
Univariate Ana,
Ms=<
pate
where X is the original feature vector, 4 is the mean of that feature vector and o is its
jandard deviation.
Exploration and Visualization
(3-15) Univariate Analysis
The z-score comes from statistics, defined as,
ook
Bi where 1 is the mean.
y subtracting the mean from the distribution, it is essentially being shifted towards left or
right by amount equal to mean ice. if there is a distribution of mean 100.and one subtracts
mean 100 from every value, then one is shifting the distribution left by 100 without
changing its shape. Thus, the new mean will be 0. When it is divided by standard deviation
, the shape of distribution is changed. The new standard deviation of this standardized
distribution is 1 which one can get putting the new mean, }1 = 0 in the z-score equation.
 
The point of normalization is to change the observations so that they can be described as a
normal distribution. Normal distribution (Gaussian distribution), also known as the bell
curve, is a specific statistical distribution where a roughly equal observations fall above
and below the mean, the mean and the median are the same and there are more observations
closer to the mean.
Standardization(also called) transforms the data such that the resulting distribution has a
mean of 0 and a standard deviation of 1.
 
‘There is a need to normalize the data if data is going to get used in machine learning or data
analysis techniques that assume that data is normally distributed e.g. t-tests, ANOVAs,
Linear Discriminant Analysis (LDA) and Gaussian Naive Bayes.
linear regression,
Example program - 6
    
[ipport numpy as np
fort seaborn a8 sns
‘matplotlib.pyplot as plt
fom skleam preprocessing import minmax_scale, scale ‘ a
E np.andom.exponential(size=1000) 7
‘Standardization
ized_data = scale(x) s i
: 3 5 |
 
      
‘Sx=plt.subplots(1,2),
 
TECHNIGAL PUBLICATIONS® - an up-thnist for knowedgeData Exploration and Visualization (3 - 16)
 
fens distplot(x, ax=ax{0], color='g')
‘ax[O].set_title(’Original data")
‘ens.distplot(standardized_data, ax=ax/1}, color='r’) 4
‘ex|1].set_title("Standardized data")
plt.show()
Example program - 6 Output .
Original data ‘Standardized data
 
: 0.0 7
° 5° 10 00 25 50 75
Fig. 3.4.3 Original and Standardized data
Example program - 7
import numpy as np"
import seabom as sns
import matplotlib.pyplot as plt
 
‘fom skleam. preprocessing import minmax. scale, scale.
E = np.random.exponential(size=1000)
{# normalization tree
normalized data = (xxmean())/(x:max()2
|F plot ea
bs ax=pit.subplots(1,2) ¢
» [sns.distplot(x, ax=ax(0}, color='y’) Tanai
[ax(0}.set_title(‘Original data")
fmssiewlednomatized dats, ‘ax=ax/1])
[Pl1].set_title("Normalized data")
pltshow()
 
  
     
    
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeee
1a Exploration and Visualization
a
example program - 7 Output
1.0
08
0.6
0.4
02
0.0 =
00 25 50
Original data
 
75
(3-17)
10.0
Univariate Analysis
Normalized data
 
0.00 0.25 0.50 0.75 1.00
Fig. 3.4.4 Original and Normalized data
Example program - 8
ape pandas as pd
Survey = pa-Datarame( {No [1000,2000, 3000),
| "Yes': [400, 500, 600)
| y
pe pd.DataFrame(Survey)
(df)
‘tplottcind = 'barcolor = ‘red’)
 
‘af notmalized = (df= dfmean() ) / df.std()
print(Normalized Data - Method 1)
print(df_ normalized)
‘af normalized plot(kind = 'bar’color
 
‘plue')
lized_df=dfapply(lambda x: (x-x.mean())/ x.std(), axis=0)
‘Print(Normalized Data - Method 2’)
fe erstees df)
omalized_df plot(kind = 'bar,color =
Example Program - S Output
 
 
TECHNIGAL PUBLICATIONS® - an up-thrust
“igreon!). ©
 
for knowiedgeUnivartato Anaiys
 
3000:
2500
2000
 
 
 
 
 
Fig. 3.4.5 Normalized data
Normalized Data - Method 4
 
 
 
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge:-
exploration On
id Visualization (3-19)
pata Univariate Analysis
1.00
2 0.75
0.50
  
 
 
 
0.254
0.00
-0.25
-0.50 |
-0.75
 
= 1.00
co 4 7 2
Fig. 3.4.7 Normalized data
a 3.5 Time Series and Smoothing Time Series
| A time series is a collection of observations of well-defined data items obtained through
repeated measurements over time. Time series data, also referred to as time-stamped data,
is a sequence of data points indexed in time order. These data points typically consist of
successive measurements made from the same source over a fixed time interval and are
used to track change over time. Time series data is a collection of observations obtained
through repeated measurements over time.
+ Atime series is a data sequence ordered (or indexed) by time. It is discrete and the interval
between each point is constant.
« Below are the examples of time series analysis,
© Electrical - impulse activity in the brain
© Rainfall measurements
© Stock prices
© Number of sunspots
© Annual retail sales
©, Monthly subscribers
© Heartbeats per minute
© Sound frequency of audio:
TECHNIGAL PUBLIGATIONS® - an up-thrus for knowiadgeData Exploration and Visualization (3-20) Univariate Analysig
(@) 3.5.1 Types of Time Series
© Time series can be classified into two different types : Stock and flow.
* Astock series is a measure of certain attributes at a point in time and can be thought of as
“stocktakes”. For example, the monthly labour force survey is a stock measure because i,
takes stock of whether a person was employed in the reference week.
© Flow series are series which are a measure of activity over a given period. For example,
surveys of retail trade activity. Manufacturing is also a flow measure because a certain
amount is produced each day and then these amounts are summed to Lael a total value for
production for a given reporting period.
© The main difference between a stock and a flow series is that flow series can contain effects
related to the calendar (trading day effects). Both types of series can still be seasonally
adjusted using the same seasonal adjustment process.
3.5.2 Properties of Time Series
* An observed time series can be decomposed into three components : The trend (long term
direction), the seasonal (systematic, calendar related movements) and the irregular
(unsystematic, short term fluctuations).
© Trend (deterministic) - A long-term increase or decrease in the data. This can be seen as
a slope (it doesn’t have to be linear) roughly going through the data.
© Seasonality (deterministic) - A time series is said to be seasonal when itis affected by
seasonal factors (hout of day, week, month, year, etc.). Seasonality can be observed with
nice cyclical patterns of fixed frequency.
© Cyelicity(deterministic) - A cycle occurs when the data exhibits rises and falls that are
not of a fixed frequency. These fluctuations are usually due to economic conditions and
are often related to the “business cycle”. The duration of these fluctuations is usually at
least 2 years.
* Irregular components/ remainder (stationary process) / Residuals - Each time series
can be decomposed in two parts namely,
© A forecast, made up of one or several forecasted values
© Residuals : They are the difference between an observation and its predicted value at
each time step.
Value of series at time t = Predicted value at time t + Residual at time t
‘TECHNICAL PUBLICATIONS® - an up-thrust for knowledge *exploration and Visualization
pote (3-21
Univariate Analysis
p 3.5.3 Decomposition of a Time Serles
n order to remove the deterministic
.
i components one c i iF
niet} an dec
separate stationary and deterministic components, rembopy ware Seuss isto
Each time series can be thought as a mix between Several parts,
 
oA trend (upward or downwards Movement of the curve on the long term)
o Aseasonal component
 
 
 
 
 
 
o Residuals.
Seasonality and Cyclical
: Trond
ic
80 go
in td
i: be}.
fe 3
3
“
: £
i 0 Ses
1975168018050 aes . © » 10
Yeu
bey
§ ‘Trend and Seasonality 2 No Deterministic Components:
5 3
E seo i ie
z
po zo
2 com b-o4
i i .
5 ao $
i 3-100 sy
i 50 «1670.~=~«m00~=«CH000 0 Om Oso eo
Year Day
Fig. 3.5.1 Time series properties
© Stationarity is the property of exhibiting constant statistical properties (mean, variance,
autocorrelation. If the mean of a time-series increases over time, then it’s not stationary.
“The general mathematical representation of the decomposition approach :
¥, = £1, SpE)
Where Y,is the time series value (actual data) at period t;
T, is a deterministic trend-cycle or general movement component;
S, is a deterministic seasonal component
E, is the irregular (remainder or residual) (stationary) component,
* The exact functional form of f (:) depends on the decomposition method used.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeData Exploration and Visualization (3-22) Univariate Anatysiy
3.5.4 Trend Stationary Time Series
 
A common approach is to assume that the equation has an additive form :
Y, = T+S8,+E
«Trend, seasonal and irregular components are simply added together.to give the obseryey
series.
© Alternatively, the multiplicative decomposition has the form :
Y = T, 7 S, 7 E
Trend, seasonal and irregular components are multiplied together to give the observed
series.
In both additive and multiplicative cases the series Y, is called a Trend Stationary (Ts)
series.
This definition means that after removing the deterministic part from a TS series, what
remains is a stationary series, If the historical data ends at time T and the process is
additive, one can forecast the deterministic part by taking, Ty ..,+ Spy, provided one
knows the analytic expression for both trend and seasonal parts and the remainder is a WN,
Time series can also be described by another, Difference Stationary (DS) model.
‘An additive model is appropriate if the magnitude of the seasonal fluctuations does not vary
with the level of time series ; I. The multiplicative model is appropriate if the seasonal
fluctuations increase or decrease proportionally with increases and decreases in the level of
the series. :
Multiplicative decomposition is more prevalent with economic series because most
seasonal economic series do have seasonal variations which increase with the level of the
series.
Rather than choosing either an additive or. multiplicative decomposition, one should
transform the data beforehand.
Basic steps in decomposition (1)
1. Estimate the trend. Two approaches :
© Using a smoothing procedure;
© Specifying a regression equation for the trend;
 
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeExploration and Visualization
pata (3-23)
Unlvariato Analysis
4, De-trending the series :
For an additive ition, this
° ek decomposition, this is done by subtracting the trend estimates from the
For a multiplicative decompositi ea i,
2 trend volues. mposition, this is done by dividing the series by the estimated
3. Estimating the seasonal factors from the detrended series :
‘© Calculate the mean (or median) values of the detrended series for each specific period
(for example, for monthly data - To estimate the seasonal effect of January - average the
detrended values for all January values in the series etc);
° Alternatively, the seasonal effects could also be estimated along with the trend by
specifying a regression equation.
o The number of seasonal factors is equal to the frequency of the series (e.g. monthly data
= 12 seasonal factors, quarterly data = 4, etc.),
4, The seasonal effects should be normalized :
o For an additive model, seasonal effects are adjusted so that the average of d seasonal
components is 0 (this is equivalent to their sum being equal to 0);
o For a multiplicative model, the d seasonal effects are adjusted so that they average to 1
(this is equivalent to their sum being equal to d);
5, Calculate the irregular component (i.e. the residuals) :
a A
For an additive model E, = Y,—T,-S,
a t
For a multiplicative model E, = 7—z~
TS,
+ Analyze the residual component. Whichever method was used to decompose the series,
the aim is to produce stationary residuals.
6 Choose a model to fit the stationary residuals (e.g. see ARMA models).
7. Forecasting can be achieved by forecasting the residuals and combining with the forecasts
of the trend and seasonal components. “
Estimating the trend, T,
e t but a relatively simple procedure
© There are various ways to estimate the trend T, at tim
valculate a moving average centered
which does not assume any specific form of T, is to ¢
ont.
TECHNICAL PUBLICATIONS® - an up-thust for knowledgeData Exploration and Visualization (3-24) Univariate Analy,
* A moving average is an average of a specific number of time series values aroun
value of t in the time series, with the exception of the first few and last few te
Procedure is available in Python with the decompose function). This method sm
time series. The estimation depends on the seasonality of the time series :
each
TMS (this
}Oothes the
© Ifthe time series has no seasonal component;
© Ifthe time series contains a seasonal component;
* Smoothing is usually done to help to better see patterns (like the trend) in the time
smoothing out the irregular roughness to see a clearer signal. For seasonal data,
smooth out the seasonality so that one can identify the trend.
Series by
one might
Estimating T, if the time series has no seasonal component In order to estimate the trend,
one can take any odd number, for example, if /= 3, one can estimate an additive model,
 
ry Yat YA Yea : .
T, = : » (two - sided averaging)
A ¥, 27 Yi t¥;,
T, = ~~, (one - sided averaging)
* In this case, the average is calculated either,
© Centered around t - One element to the left (past) and one element to the right (future),
© Oraltematively - Two elements to the left of t (past values at t— 1 and t— 2.
Estimating T, if the time series contains a seasonal component.
If the time series contains a seasonal component and one wants to average it out, the length
of the moving average must be equal to the seasonal frequency (for monthly series, one
Would take 7 = 12). However, there is a slight hurdle, Suppose, the time series begins in
January (t= 1) and one can average up to December (t= 12). This averages corresponds to
a time t = 6.5 (time between June and July). While estimating seasonal effects, there is a
need of moving average at integer times. This can be achieved by averaging the average of
January to December and the average of February (t = 2) up to January (t = 13). This
average of the two moving averages corresponds to t = 7 and the process is called
centering.
 
Thus, the
 
4>
MY 6+ teas) 12+(¥%,_5+
‘ 2
(2) ¥y 6+ Ys Vig s +12) Ying
et seers Smee
12
+¥,Qr2
 
 
 
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgoExploration and Visualization -
pata (3 - 26) Univariate Analysis:
sing the seasonal frequer i i
» By using ‘quency for the coefficients in the moving average, the procedure
izes for any seasonal fre :
eral 'Y seasonal frequency (je, quarterly, weekly, etc. series), provided the
ondition that the coefficients sum up to unity is still mot
estimating the seasonal component, S,
« Anestimate of S, at time t can be obtained by subtracting T, é
f
a A
S, = Y,-T,
« By averaging these estimates of the monthly effects for each month (January, February
etc.), one seat a single estimate of the effect for each month. That is, if the
seasonality period is d, then :
Ss, m Stra
Seasonal factors can be thought of as expected variations from trend throughout a seasonal
period, so one would expect them to cancel each other out over that period - i.e. they
should add up to zero.
4
Zs, =0
t-1
* Itshould be noted that this applies to the additive decomposition.
Estimating the seasonal component, S,
« If the estimated (average) seasonal factors S, do not add up to zero, then one can correct
them by dividing the sum of the seasonal estimates by the seasonality period and adjusting
each seasonal factor. For example, if the seasonal period is d, then
d
1. Calculate the total sum : 2 _
 
d
x
 
2. Calculate the value w =
 
ar
3. Adjust each period S, = S,-w
* Now, the seasonal components add up to zero :
doa
1S, = 0
 
TECHNICAL PUBLICATIONS® = an up-thrust for knowledgeUnivariate a,
Data Exploration and Visualization (3 - 26) Mnalyaiy
indi ment
It is common to present economic indicators such as unemploy! Percentages.
This highlights any (rend that might otherwise be maskeq
to the end of the academic year, when schools and
 
seasonally adjusted series
seasonal variation (for example, ee 3
university graduates ate seeking work). If the seasonal effect is additive, a seasonay,
x .
adjusted series is given by, Y,—S,
The described moving-average procedure usually quite successfully describes the time
series in question, however it does not allow to forecast it.
* To decide upon the mathematical form of a trend, one must first draw the plot of the time
series.
* If the behavior of the series is rather ‘regular’, one can choose a parametric trend - usually
itis a low order polynomial in t, exponential, inverse or similar functions.
* Inany case, the smoothing method is acceptable if the residuals €, = Y,— T, - 8, constitute
a stationary process.
* If there are a few competing trend specifications, the best one can be chosen by AIC, BIC
or similar criterions.
* An alternative approach is to create models for all but some TO end points and then choose
the model whose forecast fits the original data best. To select the model, one can use such
characteristics as :
z &
To
 
Root Mean Square Error, . RMSE
  
T
Mean Absolute Percentage Error, MAPE = 22 >
Ty. t-T—T)
 
S|
 
and similar statistics.
3.5.5 Transforms used for Stationarizing Data
¢ Detrending - One can remove the underlying trend in the.series. This can be done in several
ways, depending on the nature of data,
¢ Indexed data : Data measured in currencies are linked to a price index or related to
inflation. Dividing the series by this index (that is deflating) element-wise is therefore the
solution to de-trend the data.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgetion and Visualization
explora
psl@
 
Univariate Analysis:
  
Non-indexed data : Is it n
 
5 rn ry to esti e i {
Feponentil, ThE FFs Wo eases an imate if the trend is constant, linear or
Ea re casy, _
growth rate (inflation or deflation) ang pa Or the Test one it is necessary to estimate a
y the same method as for i e
pifferencing - Seasonal or cyclical patterns adie
il ems can be removed ing, periodi
values: if ne a is 12-month Seasonal, subtracting the seri bh subtracting _
wees will give a “falter” series, series with a 12-lag difference
ing - In the case where th ; lex
ogeing the compound rate i is ice i
ie ceies is not hweasurediy on a in the trend is not due to a price ind
exponential trend (recall that log(exp(x)) =
whatsoever, unlike deflation,
logging can help linearize a series with an
x). It does not remove an eventual trend
3.5.6 Checking Stationarity
plotting rolling statistics
« Plotting rolling means and variances is a first good way to visually inspect the series. If the
rolling statistics exhibit a clear trend (upwards or downwards) and show varying variance
(increasing or decreasing amplitude), then one might conclude that the series is very likely
not to be stationary.
Augmented Dickey-Fuller test
This test is used to assess whether or not a time-series is stationary. Without getting into too
much details about hypothesis testing, one should know that this test will give a result
called a “test-statistic”, based on which one can say, with different levels (or percentage) of
confidence, if the time-series is stationary or not.
KPSS
The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests for the null hypothesis that the series
is trend stationary. In other words, if the p-value of the test statistic is below the X %
confidence threshold, this means one can reject this hypothesis and that the series is not
trend-stationary with X % confidence. A p-value higher, than the threshold will lead to
accept this hypothesis and conclude that the series is trend-stationary.
Autocorrelation plots (ACF & PACF)
* An autocorrelation (ACF) plot represents the autocorrelation of the series with lags of itself.
A partial autocorrelation (PACF) plot represents the amount of correlation between a series
and a lag of itself that is not explained by correlations at all lower-order lags. Uaeally, one
Would want no correlation between the series and lags of itself. Graphically speaking, one
Would like all the spikes to fall in the blue region.
row
TECHNIGAL PUBLICATIONS® - an upsrst Tor knowtedgoData Exploration and Visualization (3 - 28) Univariate Analysis
Choosing a model
* Exponential smoothing methods are appropriate for non-stationary data (ie data witha tng
and seasonal data), ARIMA (Autoregressive Integrated Moving Average) models shoug
be used on stationary data only. One should therefore remove the trend of the data (jg
deflating or logging) and then look at the differenced series.
 
3.5.7 Smoothing Methods
The smoothing technique is a family of time-series forecasting algorithms, which Utilizes
the weighted averages of a previous observation to predict or forecast a new value, This
technique is more efficient when time-series data is moving slowly over time. It harmonizes
errors, trends and seasonal components into computing smoothing parameters.
Smoothing methods work as weighted averages. Forecasts are weighted averages of Past
observations. The weights can be uniform (this is a moving average) or following an
exponential decay - This means giving more weight to recent observations and less weight
to old observations. More advanced methods include other parts in the forecast, like
seasonal components and trend components.
1. Simple exponential smoothing
* Simple Exponential Smoothing (SES) is one of the minimal models of the exponential
smoothing algorithms. SES is the method of time series forecasting used with univariate
data with no trend and no seasonal pattern. It needs a single parameter called alpha (a),
also’ known as the smoothing factor. Alpha controls the rate at which the influence of
past observations decreases exponentially. The parameter is often set to a value between
0 and 1. This method can be used to predict series that do not have trends or seasonality,
© The simple exponential smoothing formula is given by,
8 = OH a)s,_1 = 54 FOG —8_,)
 
8, = Smoothed statistic (simple weighted average of current observation x)
= Previous smoothed statistic
 
© =. Smoothing factor of data; 0
 1,
sa ox, + (1- 0)(,_, + by)
B, = B@,-s_))+0-B)_1
here,
b, = Best estimate of the trend at time t
6 = Trend smoothing factor; 0