KEMBAR78
DSBA+Master+Codebook+ +Text+Mining+&+TSF | PDF
0% found this document useful (0 votes)
15 views11 pages

DSBA+Master+Codebook+ +Text+Mining+&+TSF

The document serves as a codebook for Data Science, focusing on essential skills in mathematical/statistical understanding, coding, and domain knowledge. It covers key topics such as Text Mining and Time Series Forecasting, providing code examples and explanations for various techniques in Python. The document emphasizes the importance of preprocessing text data and offers guidance on implementing models like Exponential Smoothing and ARIMA for time series analysis.

Uploaded by

kapisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

DSBA+Master+Codebook+ +Text+Mining+&+TSF

The document serves as a codebook for Data Science, focusing on essential skills in mathematical/statistical understanding, coding, and domain knowledge. It covers key topics such as Text Mining and Time Series Forecasting, providing code examples and explanations for various techniques in Python. The document emphasizes the importance of preprocessing text data and offers guidance on implementing models like Exponential Smoothing and ARIMA for time series analysis.

Uploaded by

kapisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DSBA

Codebook
Preface

Data Science is the art and science of solving real world problems and making data driven decisions. It involves an
amalgamation of three aspects and a good data scientist has expertise in all three of them. These are:
1) Mathematical/ Statistical understanding
2) Coding/ Technology understanding
3) Domain knowledge
Your lack of expertise should not become an impediment in your journey in Data Science. With consistent effort, you
can become fairly proficient in coding skills over a period of time. This Codebook is intended to help you become
comfortable with the finer nuances of Python and can be used as a handy reference for anything related to data science
codes throughout the program journey and beyond that.

In this document we have followed the following syntax:


- Brief description of the topic
- Followed with a code example.

Please keep in mind there is no one right way to write a code to achieve an intended outcome. There can be multiple
ways of doing things in Python. The examples presented in this document use just one of the approaches to perform
the analysis. Please explore by yourself different ways to perform the same thing.

1
Contents

PREFACE ................................................................................................................................................... 1

TEXT MINING .......................................................................................................................................... 3

Important Libraries ................................................................................................................................................... 4

TIME SERIES FORECASTING ............................................................................................................... 7

2
Text Mining
Text Analysis is a major application field for machine learning algorithms. However, the raw data, a sequence of symbols cannot
be fed directly to the algorithms themselves as most of them expect numerical features with a fixed size rather than the raw text
documents with variable length.
source: scikit-learn
Most of the data in the real world is unstructured text data and the method of mining this unstructured data or pre-processing the
text data to get useful insights is called Text Mining Analytics.
A few terminologies used in Text Mining:
1. Bag of Words: Simplification of text. Disregarding grammar.
2. Corpus: A large set of text.
3. Stop Words: Common words which are not useful for deriving meaningful insights. E.g. Articles or Prepositions
4. Stemming: Different variations of a word are changed into the original root word. E.g. Chopped and Chopping is changed
to Chop
5. Term Document Matrix (TDM): A matrix which contains the occurrence of the number of terms in each document.
6. Document Term Matrix (DTM): Transpose of TDM.
7. Term Frequency (TF): Normalized count of terms occurring in each document.
8. Inverse Document Frequency (IDF) - To be put very simply, IDF penalizes the term that occurs in almost every document.
E.g. “a , an, the”.
9. Lexicon: List of words
10. Bigrams: Collection of words taken two at a time

To analyse the text data, we can start by removing the stop words and then go on to stem words as well. Also, removing
punctuation might be a good idea.

from nltk.corpus import stopwords


‘iterative variable’ for ‘iterative variable’ in ‘variable which contains text data’ if not word in stopwords.words()

For stemming purposes, we can use various stemmers that are present in Python. The following is an example of Porter Stemmer

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = "Programers program with programing languages"

words = word_tokenize(sentence)

for w in words:

print(w, " : ", ps.stem(w))

Programers : program

program : program

with : with

programing : program

languages : language

3
You will notice that for stemming we have tokenized the data. A tokenizer is simply a function that breaks a string into a list of
words. In the following code snippet, we are removing the punctuation marks from a document of text.

import nltk

nltk.download('punkt')

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

result = tokenizer.tokenize('hey! how are you ? buddy')

print(result)
['hey', 'how', 'are', 'you', 'buddy']

If we are to count the number of vectors, the following is an in-built function in sklearn. Do refer to the sklearn documentation to
understand more about the function

Source: scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

'This is the first document.',

'This is the second document.',

'And the third one.',

'Is this the first document?',

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Important Libraries

Now, let us try to understand the functionalities of TF-IDF.

TF * IDF gives us a value for a particular word which tells us the significance of that word in the corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))

processed_features = vectorizer.fit_transform(processed_features).toarray()

Two things to check


(i) Processed features should be included before this code
(ii) In processed_features, feature should be mentioned as a sample.

Do look for the documentation of the TfidfVectorizer function in sklearn to learn more about the parameters that can be passed for
this function.
Source: scikit-learn
4
After building the TF-IDF, you have successfully managed to convert that unstructured data into structured numeric data and the
data can now be used for various Unsupervised or Supervised Learning problems.

With nltk (Natural Language Toolkit) and sklearn, we can also use regular expressions to get meaning out of our text data.

Following are just a few examples of regular expressions:

processed_features = [ ]

for sentence in range(0, len(features)): #here the unstructured has been saved in the variable ‘features’

# Remove all the special characters

processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

# remove all single characters

processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

# Remove single characters from the start

processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

# Substituting multiple spaces with single space

processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

# Removing prefixed 'b'

processed_feature = re.sub(r'^b\s+', '', processed_feature)

# Converting to Lowercase

processed_feature = processed_feature.lower()

processed_features.append(processed_feature)

Before executing these code snippets, you have to import the regular expression library by running the following code snippet
‘import re’.

To plot a Word Cloud, refer to the following code snippet. The biggest words in the Word Cloud are the words which occur the
most number of times. Do remember to remove the stop words and perform stemming of the words before the word cloud as that
will help you to get a better idea of the word occurring the most number of times correctly. Sometimes, word clouds are plotted
without stemming the words as well.

stop_words = set(stopwords.words('english')) #intialise stopwords from English Language

filtered_sentence = [] #empty list


5
for i in processed_features: # iterating in processes features through each sentence

word_tokens = word_tokenize(i) # converting each sentence to a token

for w in word_tokens:#in each token, removing stopwords from english language

if w not in stop_words:

filtered_sentence.append(w) #appending non-stopwords to filtered_sentence list

comment_words = ' ' #empty string

stop_words = set(STOPWORDS) #stopwords from Wordcloud

for words in filtered_sentence:

comment_words = comment_words + words + ' ' #converting to string

wordcloud = WordCloud(width = 1000, height = 1000, #wordcloud image creation

background_color ='white',

stopwords = stop_words,

min_font_size = 10).generate(comment_words)

# plot the WordCloud image

plt.figure(figsize = (8, 8), facecolor = None)

plt.imshow(wordcloud)

plt.axis("off")

plt.tight_layout(pad = 0)
plt.show()

The above code snippets are just some ways to preprocess the text data. By no means, they are the only means or ways to process
unstructured text data.

6
Time Series Forecasting

In this particular course, we deal with data which has some time stamps associated with it. We are going to see various tests and
techniques for predicting the future data based on the past data.

First let us see the syntaxes of Exponential Smoothing and understand how to code that in Python:

1. Simple Exponential Smoothing (SES)

from statsmodels.tsa import holtwinters as hw


build = hw.SimpleExpSmoothing(‘name of the time series’).fit() #for building the model

predict = build.forecast(steps=’for how long do you want to predict using this model’) #for predicting using the model built

Note: You can also select the value of ‘alpha’ manually while invoking the ‘.fit()’ function.

2. Holt’s Exponential Smoothing


from statsmodels.tsa.holtwinters import Holt
build = Holt(‘name of the time series’).fit() #for building the model

predict = build.forecast(steps=’for how long do you want to predict using this model’) #for predicting using the model built

Note: You can also select the value of ‘alpha’ and ‘beta’ manually while invoking the ‘.fit()’ function.

3. Holt-Winters Exponential Smoothing

from statsmodels.tsa.holtwinters import ExponentialSmoothing


build = ExponentialSmoothing (‘name of the time series’, trend=’additive’ , seasonal=’additive’).fit()
#for building the model. Here you have to mention the type of trend and seasonal components how you see fit for exponential
#smoothing. We have chosen additive. But do mention these according to the data at hand.

predict = build.forecast(steps=’for how long do you want to predict using this model’) #for predicting using the model built

Note: You can also select the value of ‘alpha’, ‘beta’, ‘gamma’ manually while invoking the ‘.fit()’ function.

Statsmodels link for Exponential Smoothing-


https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html

Now let us check how to code up the ARIMA function in Python with appropriate syntaxes:

4. To plot Auto Correlation Function (ACF) and Partial Auto Correlation Functions (PACF):

a. ACF Plot

from statsmodels.graphics.tsaplots import plot_acf


plot_acf(‘name of the time series’, ax=plt.gca())
plt.show()

b. PACF Plot
7
from statsmodels.tsa.stattools import plot_pacf
plot_pacf(‘name of the time series’, ax=plt.gca())
plt.show()

Note: You can also pass different parameters to in the plot of ACF and PACF to get the desired specific output.

5. To calculate the Moving Average (MA):

‘name of the time series’.rolling(window=’order of the moving average’).mean()

6. To check for the stationarity of the series:

a. Augmented Dickey Fuller (ADF) Test

from statsmodels.tsa.stattools import adfuller


adfuller(‘name of the time series’)

Statsmodels link for Augmented Dickey-Fuller Test:


https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.adfuller.html

b. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test


from statsmodels.tsa.stattools import kpss
kpss(‘name of the time series’)

Statsmodels link for KPSS Test


: https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.kpss.html

7. To calculate the Seasonal Auto Regressive Integrated Moving Average (SARIMA) using the auto arima functionality
which looks to return us the best model which has the minimum Akaike Information Criteria (AIC) on the training data:

import itertools
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)
# Generate all different combinations of p, d and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets


seasonal_pdq = [(x[0], x[1], x[2], ‘value of the seasonality in the SARIMA model’) for x in list(itertools.product(p, d, q))]

#Initializing the looping parameters

import numpy as np

best_aic = np.inf
best_pdq = None

8
best_seasonal_pdq = None
temp_model = None

#Loop function to calculate the auto arima

for param in pdq:

for param_seasonal in seasonal_pdq:

try:

temp_model = sm.tsa.statespace.SARIMAX(‘name of the time series’,

order = param,

seasonal_order = param_seasonal,

enforce_stationarity=True)

results = temp_model.fit()

if results.aic < best_aic:

best_aic = results.aic

best_pdq = param

best_seasonal_pdq = param_seasonal

except:

#print("Unexpected error:", sys.exc_info()[0])

continue

print("Best SARIMA{}{} model - AIC:{}".format(best_pdq, best_seasonal_pdq, best_aic))

Now that we have got the best seasonal parameters for the SARIMA, let us build the SARIMA model.

Import statesmodels.api as sm

best_model = sm.tsa.statespace.SARIMAX(‘name of the time series’,

order=(p, d, q), [(p.d.q) values got from the above loop]

seasonal_order=(P,D,Q,m), [(P,D,Q) values got from the above loop and m is the seasonal parameter])

best_results = best_model.fit() #building the model

best_results.forecast()#predicting using the model built

#To check the diagnostics of the model built

best_results.plot_diagnostics(lags=’desired lags to be specified’)

plt.show()

9
Note: You can also use the different parameters in the SARIMAX function to get the desired output of ARIMA, SARIMA and
SARIMAX. Do refer to the SARIMAX documentation in the statespace submodule of statsmodels library.

https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html

10

You might also like