EDA Unit 3 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

2K views35 pages

EDA Unit 3 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 35

Syllabus Introduction to Single variable : Distributions and Variables - Numerical Summaries of Level and Spread - Scaling and Standardizing - Inequality - ‘Smoothing Time Series. Univariate Analysis 34 Contents Introduction to Single Variable Storing and Importing Data using Python Numerical Summaries of Level and Spread Scaling and Standardizing Time Series and Smoothing Time Series Two Marks Questions with Answers aeUnivariate Data Exploration and Visualization (3-2) Analy £Q 3.1 Introduction to Single Varlable © Exploratory data analysis is cross-classified in two different hae where each meting either graphical or non-graphical and then, each method is either univariate, bivariag multivariate. * Univariate analysis is simplest analysis of statistical data, The term univariate Analysiy refers to the analysis of one variable. The prefix. “uni” means “one.” The purpose op univariate analysis is to understand the distribution of values for a single variable Univariate analysis explores each variable in a data set, separately. ¢ In other words in univariate analysis data has only one variable. It doesn’t deal with Causes or relationships and it’s major purpose is to describe data; it takes data, summarizes thay data and finds patterns in the data. * ‘Some ways one can describe patterns found in univariate data include central tendency (mean, mode and median) and dispersion : Range , variance, maximum, minimum, quartiles (including the interquartile range) and standard deviation. * Univariate analysis works by examining the effects of a singular variable on a set of data, For example, a frequency distribution table is a form of univariate analysis as frequency ig the only variable being measured. Alternative variables may be age, height, weight, etc, however it is important to note that as soon as a secondary variable is introduced it becomes bivariate analysis. With three or more variables, it becomes multivariate analysis, @ 3.1.1 univariate Statistics * Univariate analysis can be performed in a statistical setting. Two types of statistics can be used for analysis namely, descriptive and inferential, Descriptive statistics ° As the name suggests, descriptive statistics are used to describe data. The statistics used here are commonly referred to as summary statistics. * Descriptive, statistics can be used for calculating things like missing value proportions, upper and lower limits for outliers, level of variancé through the coefficient of variance, etc. Inferential statistics © +* Often, the data one is dealing with is a subset (sample) of the complete data (population). Thus, the common question here is - i © Can the findings of the sample be extrapolated to the population ? That is, is the sample representative of the population or has the Population changed ? Such questions are answered using specific hypothesis tests designed to deal with such univariate data- based problems. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeota Exploration and Visualization @ -9) a) Univariate Analysis ce Hypothesis tests help to answer crucii . population from where they eee about the data and their relation with the : awn, oa ; F echanisms come in handy here, such as - Several hypotheses or univariate testing 1. Z Test - Used for numerical (quantitati ; (quantitative) data where th ize i cH and the population’s standard deviation is known. ae 2, One-Sample t-Test - ‘ : as "0 a si oe for numerical (quantitative) data where the sample size is jess than 30 or the population’s standard deviation is unknown. 3, Chi-Square Test - Used with ordinal categorical data. 4, Kolmogoroy-Smirnoy Test - Used with nominal categorical data. ‘There are below common methods for performing univariate analysis, 1, Summary statistics 2, Frequency distributions 3. Charts 4, Univariate tables. 4. Summary statistics ‘The most common way to perform the univariate analysis is to use summary statistics to describe a variable. There are two kinds of summary statistics : .se values describe where the dataset's center or 1, Measures of central tendency - The: middle value is located. The mean, mode and median are examples. These numbers describe how evenly distributed the values 2. Dispersion measures - dard deviation and variance are some examples. » gre in the dataset. The range, stan The shape of the data distribution can explain a great deal help ‘in identifying the type of distribution followed specific properties that can be used to IL-know if the data is symmetrical, or negative kurtosis, 3, Measure of shape - about the data as the shape can by the data. Each of these distributions has s| one’s advantage. By analyzing the shapes, one wil non-symmetrical, left or right-skewed, is suffering from positive among other things. 2. Frequency distributions s occur in a dataset. © A frequency distribution describes how frequently different values This acts as another way to perform univariate analysis. TECHNICAL PUBLICATIONS® -@” paths for knowledgeUnivari Data Exploration and Visualization (3-4) (ariate Analyei, 3. Charts Another method for performing univariate analysis is to create charts that show th, distribution of values for a specific variable. Various types of graphs can be used to understand data, The standard type of graphs include - 1. Histograms : A histogram displays the frequency of each value or group of values (bins) in numerical data, This helps in understanding how the values are distributed, Rr . Boxplot : A boxplot provides several important information such as minimum, maximum, median, 1" and 3" quartiles. It is beneficial in identifying outliers in the data. 3. Density curve : The density curve helps in understanding the shape of the data’s distribution. It helps answer questions such as if the data is bimodal, normally distributed, skewed, etc. 4, Bar chart : Bar charts, mainly frequency bar charts, is a univariate chart used to find the frequency of the different categories of categorical data. 5. Pie chart : Frequency Pie charts convey similar information to bar charts. The difference is that they have a circular formation with each slice indicating the share of each category in the data. 4. Univariate tables Tables help in univariate analysis and are typically used with categorical data or numerical data with limited cardinality. Different types of tables include : 1. Frequency tables : Each unique value and its respective frequency in the data is shown through a table. Thus, it summarizes the frequency the way a histogram, frequency bar or pie chart does but in a tabular manner. 2. Grouped tables : Rather than finding the count of each unique value, the values are _ binned or grouped and the frequency of each group is reflected in the table. It is typically used for numerical data with high cardinality. "3, Percentage (Proportion) tables : Rather than showing the frequency of the unique values (or groups), such a table shows their proportion in the data (in percentage). 4. Cumulative proportion tables : It is similar to the proportion table, with the difference being that the proportion is shown cumulatively. It is typically used with binned data having a distinct order (or with categorical ordinal data). TECHNICAL PUBLICATIONS® - an up-thnust for knowledge= ee oration and Visualization pata Exp (3-5) Univariate Analysis 3.1.2 Variable and Distribution in Univariate Analysis ‘A variable in univariate analysis is a condition or subset that data falls into. Variable can be thought of as a “category.” For example, the analysis might work on a variable “height” or it might work on “weight”. Univariate analysis can be carried out on any of the individual variables in the dataset to gain a better understanding of its distribution of values. univariate data examples The salaries of employees in a specific industry; the variable in this example is employee's salaries. a The heights of ten students in a class are measured; the variable here is the student's heights. ‘A veterinarian wants to weigh 20 cats; the variable, in this case, is the weight of the cats. o Finding the average height of a country’s men from a sample. © Calculate how reliable a batsman is by calculating the variance of their runs. © Finding which country is the most frequent in winning Olympic Gold Medal by creating a frequency bar chart or frequency table. o Understanding the income distribution of a county by analyzing the distribution’s shape. ‘A right-skewed distribution can indicate an unequal society. Checking if the price of sugar has statistically significantly risen from the generally accepted price by using sample survey data. Hypothesis tests such as the Z or t-test solve such questions. © Assessing the predictive capability of a vari able by calculating the coefficient of variance. Distribution and variables Types of variables : Variables can be one of two types : Categorical or numerical. Categorical Data Categorical data classify items into groups. This type of data can be further broken down into nominal, ordifial and binary values. © Ordinal values have a set order. An examp! © Nominal values have no set order. Examples inclu alignment. © Binary data has only two values. This could be represent Je here could be a ranking of low to high. de the superhero’s gender and ted as true / false or 1/0. TEGHINIGAL PUBLIGATIONS® - an uptust fr knowiedeData Exploration and Visualization (3-6) Univariate Anaiys, * A common way to summarize categorical variables is with a frequency table, * Columns holding categorical data : Gender, Married, BankCustomer, Indus Ethnicity, PriorDefault, Employed, DrivingLicense, Citizen, Approved. Numerical data ° Numerical data are values that one can perform mathematical operations on, They are further broken down into continuous and discrete data types. © Discrete variables have to. be an integer. An example is number of superheroes. © Continuous can be any value. Examples here include height and weight. Numerical data can be visualized with a histogram. Histograms are a great first analysis of continuous data. Four main aspects to consider here are shape, center, spread and outliers, © Shape is the overall appearance of the histogram. It can be symmetric, skewed, uniform or have multiple peaks. p © Center refers to the mean or median. © Spread refers to the range or how far the data reaches. © Outliers are. data points that fall far from the bulk of the data. © Columns holding numerical and continuous data : Age, debt, YearsEmployed, CreditScore, Income. QQ 3.2 Storing and Importing Data using Python © There are various methods to import data in Python. One of the way to import data is using Pandas library. : ¢ In most simplest form data can be stored in a CSV file. CSV stands for “Comma Separated Values.” It is the simplest form of storing data in tabular form as plain text. * CSV file structure is very simple in which, the first line of a CSV file is the header and contains the names of the fields/features separated by “comma”. After the header, each line of the file is an data set value/observation/record. The values of a record are separated by “comma.” Steps to import using Pandas 1. Get the correct and full file path « Firstly, capture the full path where CSV file is stored. * For example, a CSV file is stored under the following path (ES Piresk Taata\mytoamicev Bae RS TECHNICAL PUBLICATIONS® - an up-thrust for knowledgejoration and Visualization pata EXP! (3-7) Univariate Analysis: File name - It should mad je sure cee that the file name specified in the cod 1e code matches with the File extension - The fil : ile extension should always be ‘.csv’ example program - 4 ys be ‘.csv’ when importing CSV files. poy # Python code to import the ble import pandas as pd fread the csv file (put '' b fore the path strin ¥ path, ‘such as \). 6 path string to address any special #characters in Sspaneadecey D\ Tech: det=\uy ean oe print (af) Example program 1 - Output “Name City plays Lucky Pune —sCricket ‘Aniket Mumbai Cyclist Lahu Kolhapur Tennis Shital Amarawati Badminton Nashik Cricket ‘Monit Ratnagiri Tennis Jaya “pnagar Badminton’ “Dev © paghangari Cyclist “anita Sasi acon: Badminton! / Rashmi ©) Dhule eat of the distribution and of rative variable that are of tof dispersion in the ‘number used t0 tion for 4 quantit s the amount primary interest are the seal summary is & distribution and the shape describe a specific characteristic about(3-8) _ Univariate 7 At Data Expioration and Visualisation Below are some of the useful numerical summaries © Center : Mean, median, mode five number summaries © Quantiles : Percentile © Spread : Standard deviation, variance, interquartile range © Outliers © Shape : Skewness, kurtosis © Concordance : Correlation, quantite-quantile plots. Mean ‘© This is the point of balance, describing the most typical value for normally distributed data, By “normally distributed” data it means it is highly influenced by outliers. © The mean adds up all the data values and divides by the total number of values, as follows © The ‘x-bar’ is used to represent the sample mean (the mean of a sample of data) pote Exploration and Visualization Univartate Analysis percentile ercent of data that i be . poh oe oe to or less than a given data point. It’s useful for describing 4 within the data set. If the percentile is close to zero, then the observation is one of the smallest. If the percentil en the data point is one 7 rc is int i of the largest in the data set. entile is close to 100, then tl int i quartiles (Five-number summary) . ut eed center and it’s also great to describe the spread of the data. Highly usel pee e a. hers are four quartiles and they compose the five-number summary(combined with the minimum). The Five-number summary is ‘composed of : 1. Minimum th 2 25 percentile (lower quartile) th 50"" percentile (median) ae PY th 75"" percentile (upper quartile) th 5, 100" percentile (maximum) Standard deviation Standard deviation is extensively used in statistics and data science. It measures the amount of variation or dispersion of a data set, calculating how spread out the data are from the mean. Small values mean the data is consistent and close to the mean. Larger values indicate the data is-highly variable. Deviation : The idea is to use the mean as a reference point from which everything varies. distance an observation lies from the reference point. This ‘A deviation is defined as the distance is obtained by subtracting the data point (x,) from the mean (x-bar). all the deviations will always turn ‘um up the results. Then, one can result to undo ‘on : The average of ch deviation and si Further, square root the final Calculating the standard deviati n square 21 f freedom). out tobe zero, so one cal divide it for ‘n— 1” (called degrees oO the squaring of the deviations. of all deviations in the data. It’s never negative © The standard deviation is a representation and it’s zero oily if-all the values are the sem. TECHNICAL PUBLIGATIONS® - an upthrust or knowledgeData Exploration and Visualization (3-19) 5 Univariate Araya, Variance *° Variance is almost the same calculation of the standard deviation, but it stays in SQuareg units. So, if taken the square root of the variance, one gets the standard deviation. Note tha it’s represented by ‘s-squared’, while the standard deviation is represented by ‘s’, a 2 2 Bay %-%) n=l Range * The difference between the maximum and minimum values. Useful for some basic exploratory analysis, but not as powerful as the standard deviation. ¥ ay Xay Proportion ' © It’s often referred to as “percentage”. Defines the percent of observations in the data set that satisfy some requirements. Bix p= Correlation ° Defines the strength and direction of the association between two quantitative variables. It ranges between — | and 1. Positive correlations mean that one variable increases as the other variable increases. Negative correlations mean that one variable decreases as the other increases. When the correlation is zero, there is no correlation at all. As closest to one of the extreme the result is, stronger is the association between the two variables. fa STE SE STITT import numpy as np 5 j df = pd DataFrame({(‘Indian Cinema’, ‘Restaurant’, 289.0), (RamKrishna’, ‘Restaurant’, 224.0), 5 (Zingo’, ‘Juice bar, 80.5), : e —_(The Place’, 'Play Club’, np.nan)J, Ee = a TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeexploration and Visualization pate gaamnsetmame’ ‘type’, ‘AvgBil!') ) print fpintcavol Mean = ', dfl'AvgBill.moan()) faint AvoBi Median = ', df'AvgBill'|.median()) ple program - 2 Output Exam! roane type AvgBill | jedian Cinema Restaurant 289.0 pomKrishna Restaurant — 2240 ingo Juice bar 80.5 ‘the Place Play Club NaN lavgBill Mean = 197.89393333333934 [avgBill Median = 224.0 Example program - 3 {import pandas as pd data = pd.DataFrame({'col1' ‘col2'st'y’, x, se, '2', x, yy’, "2, oe, "2 Se], ‘group'('A', ‘CB, 'B', pprint(data) pptint('Col2 Mode =", datal'col1'.mode()) Example program - 3 Output [Feoll col2 group 0 5 y A fe 2 x c. Be 7 eo B Bee 3 2 oa B 4 4 x A 5B 40 sy c b 2 y A Bees A B22 c Bed x B Bt 2 UB Be 5 x A Pel2Mode= 0 2 ‘dtype; intea Univariate Analysis 5, 2,7, 3, 4, 4, 2,3, 2,1,2,5], # Create pandas DataFrame. TEGHNIGAL PUBLIGATIONS® - an up-thrust for knowiodgeData Exploration and Visualization (3-12) Univariate Ana, alysis Example program - 4 # calculate a 5-number summary from numpy import percentile from numpy.random import rand # generate data sample data = rand(1000) ¥ calculate quartiles quartiles = percentile(data, [25, 50, 75]) # calculate min/max (date_min, data_max = data.min(), data.max() # print 5-number summary print(Min: %.3f % data_min) pprint(Qt: %.3f % quartiles|0}) print(Median: %.3f % quartiles[1]) ‘print(03: %.3f % quartiles|2]) [print(Max: %.3f.% data_max) Example program - 4 Output ‘Min: 0.001 Q1: 0.269 “Median : 0.509 03: 0.762 ‘Max: 0.999 QQ 3.4 Scaling and Standardizing Feature scaling (also known as data normalization) is the method used to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes necessary step in data preprocessing while exploring and visualizing data. © Scaling of data may be useful and/or necessary under certain circumstances (e.2. Wte? variables span different ranges). There are several different versions of scaling, the most important of which are listed below. Scaling procedures may be applied to the full é" matrix or to parts of the matrix only (e.g. column-wise). TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge= ia Exploration and Visualization al Univariate Analysis ange sealing Range scaling eae the values to another range which usually includes both a shift and a change of U : scale (magnification or reduction). In scaling (also called min-max sealing), one transforms the data such that the features are within a specific range eg: [0, 1) scaling is important in the algorithms stich as Support Vector Machines (SVM) and jenearest neighbors (KNN) where distance between the data points is important. For example, in the dataset containing prices of products; without scaling, SVM might treat 1 € equivalent to 1 Euro though 1 Euro = 90 INR. ‘The data samples are transformed according to the following equation : D, Dra ‘max Fig. 3.4.1 Range scaling Bunax~ Rin, Rmin Prax Bmax Prin Dynax : Dinin 7 (port numpy as mp 2 g aR a Pore {mport matplotlib pyplot as plt : i ‘rom skleam.preprocessing import minmax_scale, scale j ‘set seed for reproducibility prandom.seed(0) j | / : : i Generate random data points from an exponential distribution | = np.random.exponential(size=1000) mix-max scaling i Caled data =\minmax_scale(x) iia - TaTions®- an upthrust for knowledgeData Exploration and Visualization (3-14) Wiecaled data = (x-x.min())/(x.max()-x.min()) # plot both together to compare 1, ax = plt.subplots(1,2) ‘sns.distplot(x, ax=ax{0], color= ‘ax{0}.s0t_title("Original data") [sns.diotplot(scaled_data, ax=ax|1), color="r) ax{1].set_title("Scaled data") plt.show() ') Example program -5 Output Original data 00 251. 50 7.6 Scaled data 05 Fig. 3.4.2 Original and scaled data Mean centering * Subtracting the mean of the data is often called "mean centering". It results in a shift of the data towards the mean. The mean of the transformed data thereafter equals zero : Y= X-p . Standardization and Normalization * Standardization (sometimes also called autoscaling or z-transformation, z-score j malization) is the scaling procedure which results in a zero mean and unit variance of any descriptor variable. For every data value the mean 1 has to be subtracted and the result has to be divided by the standard deviation o (note that the order of these two operations must not be reversed) : y = Ky o TECHNICAL PUBLIGATIONS®- on up-hnist or knowodgo 1.0 Univariate Ana, Ms=< pate where X is the original feature vector, 4 is the mean of that feature vector and o is its jandard deviation. Exploration and Visualization (3-15) Univariate Analysis The z-score comes from statistics, defined as, ook Bi where 1 is the mean. y subtracting the mean from the distribution, it is essentially being shifted towards left or right by amount equal to mean ice. if there is a distribution of mean 100.and one subtracts mean 100 from every value, then one is shifting the distribution left by 100 without changing its shape. Thus, the new mean will be 0. When it is divided by standard deviation , the shape of distribution is changed. The new standard deviation of this standardized distribution is 1 which one can get putting the new mean, }1 = 0 in the z-score equation. The point of normalization is to change the observations so that they can be described as a normal distribution. Normal distribution (Gaussian distribution), also known as the bell curve, is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same and there are more observations closer to the mean. Standardization(also called) transforms the data such that the resulting distribution has a mean of 0 and a standard deviation of 1. ‘There is a need to normalize the data if data is going to get used in machine learning or data analysis techniques that assume that data is normally distributed e.g. t-tests, ANOVAs, Linear Discriminant Analysis (LDA) and Gaussian Naive Bayes. linear regression, Example program - 6 [ipport numpy as np fort seaborn a8 sns ‘matplotlib.pyplot as plt fom skleam preprocessing import minmax_scale, scale ‘ a E np.andom.exponential(size=1000) 7 ‘Standardization ized_data = scale(x) s i : 3 5 | ‘Sx=plt.subplots(1,2), TECHNIGAL PUBLICATIONS® - an up-thnist for knowedgeData Exploration and Visualization (3 - 16) fens distplot(x, ax=ax{0], color='g') ‘ax[O].set_title(’Original data") ‘ens.distplot(standardized_data, ax=ax/1}, color='r’) 4 ‘ex|1].set_title("Standardized data") plt.show() Example program - 6 Output . Original data ‘Standardized data : 0.0 7 ° 5° 10 00 25 50 75 Fig. 3.4.3 Original and Standardized data Example program - 7 import numpy as np" import seabom as sns import matplotlib.pyplot as plt ‘fom skleam. preprocessing import minmax. scale, scale. E = np.random.exponential(size=1000) {# normalization tree normalized data = (xxmean())/(x:max()2 |F plot ea bs ax=pit.subplots(1,2) ¢ » [sns.distplot(x, ax=ax(0}, color='y’) Tanai [ax(0}.set_title(‘Original data") fmssiewlednomatized dats, ‘ax=ax/1]) [Pl1].set_title("Normalized data") pltshow() TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeee 1a Exploration and Visualization a example program - 7 Output 1.0 08 0.6 0.4 02 0.0 = 00 25 50 Original data 75 (3-17) 10.0 Univariate Analysis Normalized data 0.00 0.25 0.50 0.75 1.00 Fig. 3.4.4 Original and Normalized data Example program - 8 ape pandas as pd Survey = pa-Datarame( {No [1000,2000, 3000), | "Yes': [400, 500, 600) | y pe pd.DataFrame(Survey) (df) ‘tplottcind = 'barcolor = ‘red’) ‘af notmalized = (df= dfmean() ) / df.std() print(Normalized Data - Method 1) print(df_ normalized) ‘af normalized plot(kind = 'bar’color ‘plue') lized_df=dfapply(lambda x: (x-x.mean())/ x.std(), axis=0) ‘Print(Normalized Data - Method 2’) fe erstees df) omalized_df plot(kind = 'bar,color = Example Program - S Output TECHNIGAL PUBLICATIONS® - an up-thrust “igreon!). © for knowiedgeUnivartato Anaiys 3000: 2500 2000 Fig. 3.4.5 Normalized data Normalized Data - Method 4 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge:- exploration On id Visualization (3-19) pata Univariate Analysis 1.00 2 0.75 0.50 0.254 0.00 -0.25 -0.50 | -0.75 = 1.00 co 4 7 2 Fig. 3.4.7 Normalized data a 3.5 Time Series and Smoothing Time Series | A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order. These data points typically consist of successive measurements made from the same source over a fixed time interval and are used to track change over time. Time series data is a collection of observations obtained through repeated measurements over time. + Atime series is a data sequence ordered (or indexed) by time. It is discrete and the interval between each point is constant. « Below are the examples of time series analysis, © Electrical - impulse activity in the brain © Rainfall measurements © Stock prices © Number of sunspots © Annual retail sales ©, Monthly subscribers © Heartbeats per minute © Sound frequency of audio: TECHNIGAL PUBLIGATIONS® - an up-thrus for knowiadgeData Exploration and Visualization (3-20) Univariate Analysig (@) 3.5.1 Types of Time Series © Time series can be classified into two different types : Stock and flow. * Astock series is a measure of certain attributes at a point in time and can be thought of as “stocktakes”. For example, the monthly labour force survey is a stock measure because i, takes stock of whether a person was employed in the reference week. © Flow series are series which are a measure of activity over a given period. For example, surveys of retail trade activity. Manufacturing is also a flow measure because a certain amount is produced each day and then these amounts are summed to Lael a total value for production for a given reporting period. © The main difference between a stock and a flow series is that flow series can contain effects related to the calendar (trading day effects). Both types of series can still be seasonally adjusted using the same seasonal adjustment process. 3.5.2 Properties of Time Series * An observed time series can be decomposed into three components : The trend (long term direction), the seasonal (systematic, calendar related movements) and the irregular (unsystematic, short term fluctuations). © Trend (deterministic) - A long-term increase or decrease in the data. This can be seen as a slope (it doesn’t have to be linear) roughly going through the data. © Seasonality (deterministic) - A time series is said to be seasonal when itis affected by seasonal factors (hout of day, week, month, year, etc.). Seasonality can be observed with nice cyclical patterns of fixed frequency. © Cyelicity(deterministic) - A cycle occurs when the data exhibits rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years. * Irregular components/ remainder (stationary process) / Residuals - Each time series can be decomposed in two parts namely, © A forecast, made up of one or several forecasted values © Residuals : They are the difference between an observation and its predicted value at each time step. Value of series at time t = Predicted value at time t + Residual at time t ‘TECHNICAL PUBLICATIONS® - an up-thrust for knowledge *exploration and Visualization pote (3-21 Univariate Analysis p 3.5.3 Decomposition of a Time Serles n order to remove the deterministic . i components one c i iF niet} an dec separate stationary and deterministic components, rembopy ware Seuss isto Each time series can be thought as a mix between Several parts, oA trend (upward or downwards Movement of the curve on the long term) o Aseasonal component o Residuals. Seasonality and Cyclical : Trond ic 80 go in td i: be}. fe 3 3 “ : £ i 0 Ses 1975168018050 aes . © » 10 Yeu bey § ‘Trend and Seasonality 2 No Deterministic Components: 5 3 E seo i ie z po zo 2 com b-o4 i i . 5 ao $ i 3-100 sy i 50 «1670.~=~«m00~=«CH000 0 Om Oso eo Year Day Fig. 3.5.1 Time series properties © Stationarity is the property of exhibiting constant statistical properties (mean, variance, autocorrelation. If the mean of a time-series increases over time, then it’s not stationary. “The general mathematical representation of the decomposition approach : ¥, = £1, SpE) Where Y,is the time series value (actual data) at period t; T, is a deterministic trend-cycle or general movement component; S, is a deterministic seasonal component E, is the irregular (remainder or residual) (stationary) component, * The exact functional form of f (:) depends on the decomposition method used. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeData Exploration and Visualization (3-22) Univariate Anatysiy 3.5.4 Trend Stationary Time Series A common approach is to assume that the equation has an additive form : Y, = T+S8,+E «Trend, seasonal and irregular components are simply added together.to give the obseryey series. © Alternatively, the multiplicative decomposition has the form : Y = T, 7 S, 7 E Trend, seasonal and irregular components are multiplied together to give the observed series. In both additive and multiplicative cases the series Y, is called a Trend Stationary (Ts) series. This definition means that after removing the deterministic part from a TS series, what remains is a stationary series, If the historical data ends at time T and the process is additive, one can forecast the deterministic part by taking, Ty ..,+ Spy, provided one knows the analytic expression for both trend and seasonal parts and the remainder is a WN, Time series can also be described by another, Difference Stationary (DS) model. ‘An additive model is appropriate if the magnitude of the seasonal fluctuations does not vary with the level of time series ; I. The multiplicative model is appropriate if the seasonal fluctuations increase or decrease proportionally with increases and decreases in the level of the series. : Multiplicative decomposition is more prevalent with economic series because most seasonal economic series do have seasonal variations which increase with the level of the series. Rather than choosing either an additive or. multiplicative decomposition, one should transform the data beforehand. Basic steps in decomposition (1) 1. Estimate the trend. Two approaches : © Using a smoothing procedure; © Specifying a regression equation for the trend; TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeExploration and Visualization pata (3-23) Unlvariato Analysis 4, De-trending the series : For an additive ition, this ° ek decomposition, this is done by subtracting the trend estimates from the For a multiplicative decompositi ea i, 2 trend volues. mposition, this is done by dividing the series by the estimated 3. Estimating the seasonal factors from the detrended series : ‘© Calculate the mean (or median) values of the detrended series for each specific period (for example, for monthly data - To estimate the seasonal effect of January - average the detrended values for all January values in the series etc); ° Alternatively, the seasonal effects could also be estimated along with the trend by specifying a regression equation. o The number of seasonal factors is equal to the frequency of the series (e.g. monthly data = 12 seasonal factors, quarterly data = 4, etc.), 4, The seasonal effects should be normalized : o For an additive model, seasonal effects are adjusted so that the average of d seasonal components is 0 (this is equivalent to their sum being equal to 0); o For a multiplicative model, the d seasonal effects are adjusted so that they average to 1 (this is equivalent to their sum being equal to d); 5, Calculate the irregular component (i.e. the residuals) : a A For an additive model E, = Y,—T,-S, a t For a multiplicative model E, = 7—z~ TS, + Analyze the residual component. Whichever method was used to decompose the series, the aim is to produce stationary residuals. 6 Choose a model to fit the stationary residuals (e.g. see ARMA models). 7. Forecasting can be achieved by forecasting the residuals and combining with the forecasts of the trend and seasonal components. “ Estimating the trend, T, e t but a relatively simple procedure © There are various ways to estimate the trend T, at tim valculate a moving average centered which does not assume any specific form of T, is to ¢ ont. TECHNICAL PUBLICATIONS® - an up-thust for knowledgeData Exploration and Visualization (3-24) Univariate Analy, * A moving average is an average of a specific number of time series values aroun value of t in the time series, with the exception of the first few and last few te Procedure is available in Python with the decompose function). This method sm time series. The estimation depends on the seasonality of the time series : each TMS (this }Oothes the © Ifthe time series has no seasonal component; © Ifthe time series contains a seasonal component; * Smoothing is usually done to help to better see patterns (like the trend) in the time smoothing out the irregular roughness to see a clearer signal. For seasonal data, smooth out the seasonality so that one can identify the trend. Series by one might Estimating T, if the time series has no seasonal component In order to estimate the trend, one can take any odd number, for example, if /= 3, one can estimate an additive model, ry Yat YA Yea : . T, = : » (two - sided averaging) A ¥, 27 Yi t¥;, T, = ~~, (one - sided averaging) * In this case, the average is calculated either, © Centered around t - One element to the left (past) and one element to the right (future), © Oraltematively - Two elements to the left of t (past values at t— 1 and t— 2. Estimating T, if the time series contains a seasonal component. If the time series contains a seasonal component and one wants to average it out, the length of the moving average must be equal to the seasonal frequency (for monthly series, one Would take 7 = 12). However, there is a slight hurdle, Suppose, the time series begins in January (t= 1) and one can average up to December (t= 12). This averages corresponds to a time t = 6.5 (time between June and July). While estimating seasonal effects, there is a need of moving average at integer times. This can be achieved by averaging the average of January to December and the average of February (t = 2) up to January (t = 13). This average of the two moving averages corresponds to t = 7 and the process is called centering. Thus, the 4> MY 6+ teas) 12+(¥%,_5+ ‘ 2 (2) ¥y 6+ Ys Vig s +12) Ying et seers Smee 12 +¥,Qr2 TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgoExploration and Visualization - pata (3 - 26) Univariate Analysis: sing the seasonal frequer i i » By using ‘quency for the coefficients in the moving average, the procedure izes for any seasonal fre : eral 'Y seasonal frequency (je, quarterly, weekly, etc. series), provided the ondition that the coefficients sum up to unity is still mot estimating the seasonal component, S, « Anestimate of S, at time t can be obtained by subtracting T, é f a A S, = Y,-T, « By averaging these estimates of the monthly effects for each month (January, February etc.), one seat a single estimate of the effect for each month. That is, if the seasonality period is d, then : Ss, m Stra Seasonal factors can be thought of as expected variations from trend throughout a seasonal period, so one would expect them to cancel each other out over that period - i.e. they should add up to zero. 4 Zs, =0 t-1 * Itshould be noted that this applies to the additive decomposition. Estimating the seasonal component, S, « If the estimated (average) seasonal factors S, do not add up to zero, then one can correct them by dividing the sum of the seasonal estimates by the seasonality period and adjusting each seasonal factor. For example, if the seasonal period is d, then d 1. Calculate the total sum : 2 _ d x 2. Calculate the value w = ar 3. Adjust each period S, = S,-w * Now, the seasonal components add up to zero : doa 1S, = 0 TECHNICAL PUBLICATIONS® = an up-thrust for knowledgeUnivariate a, Data Exploration and Visualization (3 - 26) Mnalyaiy indi ment It is common to present economic indicators such as unemploy! Percentages. This highlights any (rend that might otherwise be maskeq to the end of the academic year, when schools and seasonally adjusted series seasonal variation (for example, ee 3 university graduates ate seeking work). If the seasonal effect is additive, a seasonay, x . adjusted series is given by, Y,—S, The described moving-average procedure usually quite successfully describes the time series in question, however it does not allow to forecast it. * To decide upon the mathematical form of a trend, one must first draw the plot of the time series. * If the behavior of the series is rather ‘regular’, one can choose a parametric trend - usually itis a low order polynomial in t, exponential, inverse or similar functions. * Inany case, the smoothing method is acceptable if the residuals €, = Y,— T, - 8, constitute a stationary process. * If there are a few competing trend specifications, the best one can be chosen by AIC, BIC or similar criterions. * An alternative approach is to create models for all but some TO end points and then choose the model whose forecast fits the original data best. To select the model, one can use such characteristics as : z & To Root Mean Square Error, . RMSE T Mean Absolute Percentage Error, MAPE = 22 > Ty. t-T—T) S| and similar statistics. 3.5.5 Transforms used for Stationarizing Data ¢ Detrending - One can remove the underlying trend in the.series. This can be done in several ways, depending on the nature of data, ¢ Indexed data : Data measured in currencies are linked to a price index or related to inflation. Dividing the series by this index (that is deflating) element-wise is therefore the solution to de-trend the data. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgetion and Visualization explora psl@ Univariate Analysis: Non-indexed data : Is it n 5 rn ry to esti e i { Feponentil, ThE FFs Wo eases an imate if the trend is constant, linear or Ea re casy, _ growth rate (inflation or deflation) ang pa Or the Test one it is necessary to estimate a y the same method as for i e pifferencing - Seasonal or cyclical patterns adie il ems can be removed ing, periodi values: if ne a is 12-month Seasonal, subtracting the seri bh subtracting _ wees will give a “falter” series, series with a 12-lag difference ing - In the case where th ; lex ogeing the compound rate i is ice i ie ceies is not hweasurediy on a in the trend is not due to a price ind exponential trend (recall that log(exp(x)) = whatsoever, unlike deflation, logging can help linearize a series with an x). It does not remove an eventual trend 3.5.6 Checking Stationarity plotting rolling statistics « Plotting rolling means and variances is a first good way to visually inspect the series. If the rolling statistics exhibit a clear trend (upwards or downwards) and show varying variance (increasing or decreasing amplitude), then one might conclude that the series is very likely not to be stationary. Augmented Dickey-Fuller test This test is used to assess whether or not a time-series is stationary. Without getting into too much details about hypothesis testing, one should know that this test will give a result called a “test-statistic”, based on which one can say, with different levels (or percentage) of confidence, if the time-series is stationary or not. KPSS The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests for the null hypothesis that the series is trend stationary. In other words, if the p-value of the test statistic is below the X % confidence threshold, this means one can reject this hypothesis and that the series is not trend-stationary with X % confidence. A p-value higher, than the threshold will lead to accept this hypothesis and conclude that the series is trend-stationary. Autocorrelation plots (ACF & PACF) * An autocorrelation (ACF) plot represents the autocorrelation of the series with lags of itself. A partial autocorrelation (PACF) plot represents the amount of correlation between a series and a lag of itself that is not explained by correlations at all lower-order lags. Uaeally, one Would want no correlation between the series and lags of itself. Graphically speaking, one Would like all the spikes to fall in the blue region. row TECHNIGAL PUBLICATIONS® - an upsrst Tor knowtedgoData Exploration and Visualization (3 - 28) Univariate Analysis Choosing a model * Exponential smoothing methods are appropriate for non-stationary data (ie data witha tng and seasonal data), ARIMA (Autoregressive Integrated Moving Average) models shoug be used on stationary data only. One should therefore remove the trend of the data (jg deflating or logging) and then look at the differenced series. 3.5.7 Smoothing Methods The smoothing technique is a family of time-series forecasting algorithms, which Utilizes the weighted averages of a previous observation to predict or forecast a new value, This technique is more efficient when time-series data is moving slowly over time. It harmonizes errors, trends and seasonal components into computing smoothing parameters. Smoothing methods work as weighted averages. Forecasts are weighted averages of Past observations. The weights can be uniform (this is a moving average) or following an exponential decay - This means giving more weight to recent observations and less weight to old observations. More advanced methods include other parts in the forecast, like seasonal components and trend components. 1. Simple exponential smoothing * Simple Exponential Smoothing (SES) is one of the minimal models of the exponential smoothing algorithms. SES is the method of time series forecasting used with univariate data with no trend and no seasonal pattern. It needs a single parameter called alpha (a), also’ known as the smoothing factor. Alpha controls the rate at which the influence of past observations decreases exponentially. The parameter is often set to a value between 0 and 1. This method can be used to predict series that do not have trends or seasonality, © The simple exponential smoothing formula is given by, 8 = OH a)s,_1 = 54 FOG —8_,) 8, = Smoothed statistic (simple weighted average of current observation x) = Previous smoothed statistic © =. Smoothing factor of data; 0 1, sa ox, + (1- 0)(,_, + by) B, = B@,-s_))+0-B)_1 here, b, = Best estimate of the trend at time t 6 = Trend smoothing factor; 0
You might also like
EDA Unit 4 Notes
No ratings yet
EDA Unit 4 Notes
22 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
EDA Unit 2 Notes
No ratings yet
EDA Unit 2 Notes
61 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
EDA Lab Manual for Students
No ratings yet
EDA Lab Manual for Students
41 pages
EDA Unit V
No ratings yet
EDA Unit V
28 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Ad3301 - Data Exploration and Visualization
100% (6)
Ad3301 - Data Exploration and Visualization
2 pages
Cognitive Science UNIT 4
No ratings yet
Cognitive Science UNIT 4
10 pages
ReLu Heuristics For Avoiding Local Bad Minima
100% (2)
ReLu Heuristics For Avoiding Local Bad Minima
10 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
EDA Unit3
No ratings yet
EDA Unit3
44 pages
AD3391 Database Design and Management Lecture Notes 1
No ratings yet
AD3391 Database Design and Management Lecture Notes 1
234 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
OCS353 - Data Science Manual-FULL
100% (2)
OCS353 - Data Science Manual-FULL
64 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
Inferential Statistics Guide
No ratings yet
Inferential Statistics Guide
37 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
CS3492 Database Management Systems Apr May 2023 Question Paper Download
75% (4)
CS3492 Database Management Systems Apr May 2023 Question Paper Download
3 pages
Ad3301 Dev Full Notes
No ratings yet
Ad3301 Dev Full Notes
53 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
No ratings yet
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
3 pages
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
CS3352 FDS QP Solved (Anna University)
No ratings yet
CS3352 FDS QP Solved (Anna University)
98 pages
Cognitive Science Unit 3
No ratings yet
Cognitive Science Unit 3
15 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
Fundamentals of Data Science Lab Manual New1
100% (1)
Fundamentals of Data Science Lab Manual New1
32 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
Ocs353 Data Science Fundamentals Notes
No ratings yet
Ocs353 Data Science Fundamentals Notes
145 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
cs3362 Foundations of Data Science Lab Manual
67% (9)
cs3362 Foundations of Data Science Lab Manual
53 pages
CS3452 Theory of Computation Apr May 2023 Question Paper Download
100% (2)
CS3452 Theory of Computation Apr May 2023 Question Paper Download
3 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Ccs341 DW Notes All 5 Units
100% (1)
Ccs341 DW Notes All 5 Units
159 pages
Dpco Unit-3 Notes
No ratings yet
Dpco Unit-3 Notes
31 pages
Fdsa Unit 2
No ratings yet
Fdsa Unit 2
89 pages
CS3491 Artificial Intelligence and Machine Learning Nov Dec 2023 Question Paper Download
100% (2)
CS3491 Artificial Intelligence and Machine Learning Nov Dec 2023 Question Paper Download
2 pages
Unit 2
No ratings yet
Unit 2
34 pages
AD3491 FDSA Syllabus
No ratings yet
AD3491 FDSA Syllabus
2 pages
CB3401 Unit1
No ratings yet
CB3401 Unit1
60 pages
cs3361 Data Science Lab Record Manual
89% (9)
cs3361 Data Science Lab Record Manual
92 pages
CCS341 Set1
67% (3)
CCS341 Set1
2 pages
AD3251 Data Structures Design Question Bank 1
No ratings yet
AD3251 Data Structures Design Question Bank 1
1 page
UNIT 1 Exploratory Data Analysis
100% (4)
UNIT 1 Exploratory Data Analysis
21 pages
Experiment 5
100% (1)
Experiment 5
6 pages
CD3291 Data Structurres and Algorithm Lab Manual
No ratings yet
CD3291 Data Structurres and Algorithm Lab Manual
84 pages
Cognitive Science Unit 1
No ratings yet
Cognitive Science Unit 1
15 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
CS3491 Ai & ML Lab Manual
No ratings yet
CS3491 Ai & ML Lab Manual
57 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
Data Science Lab Manual - CS3361-Ramprakash S
No ratings yet
Data Science Lab Manual - CS3361-Ramprakash S
47 pages
Bayesian Inference & Networks Guide
100% (1)
Bayesian Inference & Networks Guide
21 pages
Introduction To Unvariate Analysis
No ratings yet
Introduction To Unvariate Analysis
2 pages
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
No ratings yet
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
32 pages
Data Analysis-Univariate & Bivariate
50% (2)
Data Analysis-Univariate & Bivariate
9 pages