Preprocessing Harvested Public Data Stream
From Twitter
Romulo L. Olalia, Jr.
Doctor in Information Technology
Graduate Programs, Technological Institute of the Philippines
Quezon City, Philippines
+63925-5286651
romulo.olalia@sancarloscollege.edu.ph
ABSTRACT preprocessing requirement. This algorithm will be based
This paper introduces an algorithm of preprocessing on consolidated pre-processing techniques summary
harvested public data from Twitter for sentiment analysis. from popular and respected researches on sentiment
Different preprocessing techniques employed by different analysis. It will be ideal for researchers or developers
researchers were map and analyze. Based on this analysis, that would like to explore natural language processing. It
an algorithm was created that can handle most may also serve as framework or a guide to future
preprocessing requirement which can be used by future researchers that will undergo studies pertaining to
researchers on topic of sentiment analysis particularly on sentiment analysis particularly in Twitter.
microblog.
Actual sentiment analysis will not be done on this study
Keywords but rather those critical steps in preparing the data that
Public data, twitter, microblog, preprocessing technique, will be used for sentiment analysis.
preprocessing algorithm, sentiment analysis
1. INTRODUCTION 2. THEORETICAL FRAMEWORK
There is an increasing amount of information that people Twitter is the top 8 most popular site according to Alexa
are sharing in the web (IDC, 2014). This information is in (Alexa, 2016). Microblogs are short post or messages
the form of blog, reviews, post, social networks, forum, that people used to communicate. There are around 500
tweets and others. This resulted in an environment million tweets are being posted everyday whose content
wherein everyone can express their thoughts, opinion ranges from politics, announcement, jokes, opinions,
and sentiments regarding people, product, event, and news or more (Internet Live Stats, 2016). These tweets
belief, especially on microblogging (Caitlin, 2012). can be harvested and processed which gives the
However, only a tiny portion of this colossal amount of researcher an ample amount of information for domain
public data has been explored for analytic value (Gantz requirements such as sentiment analysis.
and Reinsel, 2012). But before this information may even
be analyze, important steps in preparing this Mining sentiments from a natural language is
information should be done starting on the harvesting challenging because it requires “deep understanding on
process from its source until it became a trained data. explicit and implicit regular and irregular, and
syntactical and semantic language rules” (Cambria et al,
This paper aims to explore different process in mining 2013). While it is true that tweets contain only a
information particularly in twitter using available API maximum of 140 characters, people have a rather
and third-party product. After harvesting public data, unusual way of writing so. People use acronym, make
preprocessing of these data will be introduced. Initially, spelling mistakes, use emoticons and other symbol to
attributes of tweets are determined to give the express feelings (Agarwal et. al., 2011) (Davies and
researcher enough understanding on the processes to be Ghahramani, 2011) (Sharma and Vjas, 2011).
done. Known researches tackling sentiment analysis will
be reviewed and analyzed. Preprocessing procedures There are numbers of researches that states the
will be determined and mapped so as to come up with a harvesting and preprocessing techniques they’ve used in
consolidated preprocessing procedures that will be ideal analyzing public data such as tweets. However, these
in doing sentiment analysis in Twitter. Based on the processes are specifically selected to support the
result, an algorithm was created that can handle most algorithms or approaches they used in their respective
6
study, thus a need in thorough study on the processing from traditional grammar (Han and Baldwin, 2011)
techniques has rarely been studied or conducted. and more casual (Go et al, 2011)
In most studies, harvesting of tweets gone through Retweeted Tweets can be retweeted by users. The
Twitter API’s: Firehose, public streaming and streaming letters “RT” signifies that a given tweet has been
(Pack and Paroubek, 2010) (Go. et. al., 2011) (Narr et. al., retweeted.
2012) (Davies and Ghahramani, 2011). These API offers
varying features. Regardless though, this API offers Contains Emoticons and Special Characters
researchers a great way to harvest needed corpus of Emoticons are normally found in tweets that suggest
tweets. Narr et al (2012), and Argueta and Chen (2014) sentiments. Research of Pak and Paroubek (2010)
used human-annotated multilingual sentiment dataset assumes that emoticons suggest the overall sentiment
from tweets in four languages (English, German, French of the entire message.
and Portuguese). Their study attempted to identify the
sentiments from tweets without depending on the Contains Intensifiers Abbreviations such as WTF,
language used but instead used emoticons as noisy BRB, LOL are included in tweets including character
labels. Moreover, Mohammad (2013) uses combination repetitions such as “Happppy”.
of human-annotated training data and automatically
generated which they found that is more effective than Contains User Names Users often directed post to
the latter. Agarwal (2011) only used human annotated other users using the symbol @ before the user name.
corpus because of its accuracy.
Contains HTMLS Markups Sometimes, messages
Different researchers collected data sets from other includes markup tags such as <br>.
sources aside from Twitter. Kouloumpis et.al. (2012) got
their data from three sources: Eidenburg Twitter corpus, Others Some characters includes double word,
emoticons from twittersentiment.appspot.com and numbers, URL, and varying cases (upper case and
annotated data from iSieve Corp while Go et. al. (2011) lower case)
created a custom web application to retrieved data.
Publicly available data set from Choudhury was used by In their study, Go et. al. (2011) treated emoticons as
Wakade et. al. (2011). Taboada et. al. (2016) utilizes “noisy labels” while Argueta and Chen (2014) uses
Amazon Mechanical Turk and SO-CAL to retrieved their sentiment-bearing hash as same. Emoticons where
needed data. Mohammad et al (2013) used Roget’s stripped down from their training data because it gives
Thesaurus to populate its trained data by looking at the negative impact on the accuracy of their employed
synonyms of the words including Sentiment140 Lexicon. algorithm and will go against the objective of their study.
Message with both positive and negative were removed
Collection for corpus is also of varying volume. The by Kouloumpis et. al. (2012) in their study. However,
amount of information retrieved from the works Pack and Paroubek (2010) and Davies and Ghahramani
reviewed in this paper ranges from 11,875 (Agarwal, (2011) relied on emoticons in defining their training
2011) up to 800 million (Narr et. al., 2012) tweets or data so as with Narr et al (2012).
similar information.
Stop words are not removed in the study of Maas et al
In order to effectively process and prepare the set of data (2011) because they argued that these words are
from Twitter for sentiment analysis, it is expedient to indicative of sentiments. Stemming was not applied
know the attributes and characteristics of these tweets. either because their models “learns similar
Understanding these attributes will give us hints on the representations for words of the same stem when the
appropriate preprocessing techniques that needs to be data suggest it”. Additionally, non-words were not also
applied. These attributes include: removed like ! or emoticons.
Length The maximum length is 140 characters. Not much filtering was also done in the work of
Average calculated length is 14 words or 78 characters Mohammad et al (2013). Aside from building n-grams
(Go et. al, 2011) and is normally 1 sentence (Pak and from word to character level, they used elongated words
Paroubek, 2010). like “sooo”, emoticons, punctuations, stopwords,
usernames, URL and capital letters but done
Language Model Tweets includes words that can’t be tokenization.
found in the dictionary, slang and/or misspelled.
Language used in Twitter “is often informal and differs
7
There are a lot of pre-processing techniques for study and brought out impressive result even though
harvested public data. Researches reviewed in this paper some researchers do not openly state their processes.
offers variety of such techniques with varying level of
difficulty and customization. While some of it are widely- Researches reviewed in this paper uses combination of
used, some are novel. Hemalatha, et al (2011) created the following:
their own algorithm (SES) in processing data in a. Removal of URL
preparation for the calculation of sentiment. b. Removal of user name
c. Removal of emoticons
Paper of Kannan and Gurusamy (2014) discuss pre- d. Removal of HTML mark-up
processing techniques and their corresponding issues e. Removal of numbers
for text mining. These includes tokenization, stop word
f. Removal of stop words
removal, and stemming. Challenges faced in preparation
were enumerated as per study of Kharde and Sonawane
g. Removal of retweets
(2016) shows direct relationship of this challenges to h. Conversion to lowercase / uppercase
tweet’s attribute. Bao, et. al (2014) evaluated the effects i. Substitution of repeated text
of attributes in sentiment analysis. Their study shows j. Tokenization
that sentiment classification accuracy rises when URL, k. Poster stemming / lemmatization
negation transformation, repeated characters’ l. Spam account removal
normalization are employed while decreases when
stemming and lemmatization was applied. Moreover, In table 1, a blank entry does not mean that the process
another pre-processing technique was carried out by was done or not. It’s just that the research did not openly
Davies and Ghahramani (2011) which proves to have a state the utilization of such preprocessing techniques. It
point. In their study, they have remove spam accounts, is therefore ignored for the simplicity of the
corporate accounts and bots. They argued that the interpretation.
number of these accounts are more than the legitimate
one which will dramatically affect the sentiment Columns with all checked (√) such as the removal of
analysis. Research shows that user with more than 1000 HTML markup tags, removal of numbers, removal of
followers or follows more than 1000 may indicate that retweets, removal of spam accounts, conversion of
account is as stated above. Also, 29% of content in a tweets to all lowercase and uppercase, tokenization and
tweet was either rumors and fake (Gupta, et.al., 2013). stemming, regardless of the number can be interpreted
as generally accepted since no researcher from the list
On the succeeding section of this paper, mapping of specifically opposed to it. Other columns display
preprocessing techniques done by different researchers combination and needs further investigation as to why.
will be presented.
Mohammad and Agarwal, et. al. did not remove URL but
3. OPERATIONAL FRAMEWORK rather replace it with URL with prefix http:// and with a
constant tag ||U|| respectively. These was done just to
Collecting dataset from Twitter can be done in two ways: normalized the tweets and does not directly affect the
Search API and Streaming API. Search API provides the sentiment analysis. Additionally, Agarwal just also
ability to perform specific, low throughput queries. It is replaces username with ||U||. These can be easily
limited from preceding 5 days can be search and is dismissed and removal of URL and user names can be
limited to 10 tweets per minute (Davies and included in the preprocessing techniques.
Ghahramani, 2011). Streaming API allows access to
unlimited, live stream of tweets. Among the two, using The table also shows that some column has combination
Streaming API will ensure researchers of recent tweets of √s and Xs which is worth noting. These are removal of
and collection of substantial size of tweets quickly. emoticons, removal of stop words, removal of hashtags,
and substitution of repeated text.
To determine the most effective combination of pre-
processing techniques that can be utilized in sentiment Looking through their researches, these preprocessing
analysis, it is expedient to map researchers’ employed technique were not intentionally carried out because
pre-processing approaches as shown in Table 1. This will they believe that these features indicates sentiment.
give us an idea how these researchers carried out their Emoticons, although not all can indicate sentiment.
Emoticon such as :) denotes happiness while :( may
8
Table 1. Mapping of pre-processing procedures done by different researchers for sentiment analysis
Conv.
Removal of … To Substi-
Stemming /
Lower- tution of Tokeni-
Researchers Lemma-
HTML case / repeated zation
User Emoti- Num- Stop Hash Spam tizing
URL Mark Retweets Upper- text
Name cons bers Words Tag Account
Up case
Pak and
√ √ √ √ √ √
Paroubek
Kouloumpis,
√ √ X √ √ √ √ √
et. al.
Go, et. al. √ √ √ √ √
Narr, et. al. √ √ X √ √ √
Maas X X
Mohammad
X X √ X √ X √
et. al.
Agarwal, et.
al.
X X X √ √
Sharma and
Vjas
√ √ √ √ √ √ √
Davies and
√ √ √
Gharhramani
Legend: √ - done in research | X – not done in research
indicate sadness. It is also interesting to note that A general rule suggests that if you will use a certain
researchers who uses emoticons for sentiment analysis attribute of twitter message, the attribute should not be
are those researchers who focused their study on multi- removed. Also, it is necessary to think about the data
lingual or language-independent domain.
problem we are trying to solve before utilizing a
Mass did not remove stop word (e.g. Not) because he preprocessing technique. Thus, the following logic has
argued that stop words may also indicate sentiment. The been employed:
word “happy” with a “not” before it negates the
sentiment of being happy. Hash tags such as #terrible tweetx = {w, w in tweetx , cattr ∈|∉ tweetx }
also indicates emotions that’s why Mohammad, et. al.
didn’t remove it from their research. Moreover, they also
whereas, tweet is the tweet corpus after normalizing the
argued that there is a certain degree of difference
between words such as “angry” and “angryyyyyyyy” x time and attr is the attribute or attribute of c corpus to
that’s why elongated words where not also removed be removed or retain.
from their study. For many problems, it makes sense to
remove punctuation. On the other hand, it is possible Following these, the algorithm below was used in actual
that "!!!" or ":-(" could carry sentiment, and should be preprocessing of tweets. This is specifically design for
treated as words. python language. This is a combination of actual python
code and pseudo statements.
import nltk, re, pprint # imports needed library
from __future__ import division
from __future__ import print_function
from nltk.stem import *
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from replacers import RepeatReplacer
Begin
Filter tweets using keyword # remove spam and retweet
if (tweets_owner’s_follower < 1000 || tweets_owner’s_being_followed < 1000 || tweets !contain “RT”) {
f = open('tweets.txt')
raw = f.read()
ptwt = re.sub('[0-9]','',raw) # remove numbers
9
ptwt = ptwt.lower() # convert to lowercase
if (whole #hastag is not needed) {
ptwt = ptwt.split()
ptwt = [w for w in ptwt if re.search('#.+', w)] # remove hashtags
} else {
if (# symbol is not needed) {
ptwt = re.sub('#', '', ptwt) # remove hashtags
ptwt = ptwt.split()
}
}
ptwt = [w for w in ptwt if not re.search('<.+>', w)] # remove HTML tags
ptwt = [w for w in ptwt if not re.search('@.+', w)] # remove usernames
ptwt = [w for w in ptwt if not re.search('^(http).+', w)] # remove URL
if (stopwords is not needed in the study) {
from nltk.corpus import stopwords # remove stop words
ptwt = [w for w in ptwt if w not in stopwords.words('english')]
}
if (emoticons is not needed) {
emot = open(‘emoticons.txt’) # remove emoticons
emot = emot.read()
emot = emot.split()
ptwt = [w for w in ptwt if w not in emot]
}
if (punctuation is not needed) {
ptwt = ' '.join(ptwt) # remove non-alpha & non-space
ptwt = re.sub('[^a-zA-Z ]', '', ptwt)
}
if (repeated character is not needed) {
ptwt = ptwt.split() # replace repeated characters
replacer = RepeatReplacer()
cnt = 0
for e in ptwt:
ptwt[cnt] = replacer.replace(e)
cnt = cnt + 1
ptwt = ' '.join(ptwt)
replacer = RepeatReplacer()
ptwt = replacer.replace(ptwt)
}
if (stemming is required) {
stemmer = SnowballStemmer('english') # stem using English language
ptwt = stemmer.stem(ptwt)
}
if (lemmataton is required) {
lemmatizer = WordNetLemmatizer() # lemmatize
ptwt = lemmatizer.lemmatize(ptwt)
}
}
End
To see the effect of the mapping, these preprocessing compared against the tweet, removing any occurrence of
techniques was applied to actual tweets. Sample tweet stop words. Same process was applied in removing
was gathered using Twitters Streaming API. Using a emoticons. Emoticons corpus are taken from a list of
custom web application, tweets were filtered using the emoticons provided by Wikipedia.
keyword “Duterte”. Spam accounts and retweets were
also removed. In stemming words, the SnowBall stemmer which was
built in with NLTK was utilized. The WordNet
To carry out these preprocessing procedures, Natural Lemmatizer was used to lemmatize the tweet. Repeated
Language Toolkit (NLTK) was used together with characters like “happyyyyy” found on the tweet was
Python. Regular expressions are also employed to replaced with valid word using a user-defined function
remove characters or words that are unnecessary like applying combination of regular expression. However,
numbers, hash / hashtags, HTML tags, URLs, Twitter user words like “happyyyyy” when treated will become
names, and punctuations. In removing stop words, “hapy” which is wrong. To address the problem,
Stopwords corpus from NLTK was imported and WordNet lookup was done to recognize the word.
10
Therefore, WordNet corpus was imported at located on the corpus. Table 2 shows a sample result of
replacement of characters will stop if the word was the test on a single tweet.
Table 2: Result of the test after applying the algorithm
Preprocessing
Processed Tweet
Technique Used
@Duterte President is really good. Keep it up. #Change :) http://fb.com Looove it! <br/> GOD
(Original)
BLESS!!! 2016 (",)
@Duterte President is really good. Keep it up. #Change :) http://fb.com Looove it! <br/> GOD
Removal of numbers
BLESS!!! (",)\n
@duterte president is really good. keep it up. #change :) http://fb.com looove it! <br/> god
Conversion to lowercase
bless!!! (",)\n
@duterte president is really good. keep it up. change :) http://fb.com looove it! <br/> god bless!!!
Removal of hashtag
(",)\n
['@duterte', 'president', 'is', 'really', 'good.', 'keep', 'it', 'up.', 'change', ':)', 'http://fb.com', 'looove',
Tokenization
'it!', '<br/>', 'god', 'bless!!!', '(",)']
['@duterte', 'president', 'is', 'really', 'good.', 'keep', 'it', 'up.', 'change', ':)', 'http://fb.com', 'looove',
Removal of HTML tags
'it!', 'god', 'bless!!!', '(",)']
['president', 'is', 'really', 'good.', 'keep', 'it', 'up.', 'change', ':)', 'http://fb.com', 'looove', 'it!', 'god',
Removal of Usernames
'bless!!!', '(",)']
Removal of URL ['president', 'is', 'really', 'good.', 'keep', 'it', 'up.', 'change', ':)', 'looove', 'it!', 'god', 'bless!!!', '(",)']
Removal of stop words ['president', 'really', 'good.', 'keep', 'up.', 'change', ':)', 'looove', 'it!', 'god', 'bless!!!', '(",)']
Removal of emoticons ['president', 'really', 'good.', 'keep', 'up.', 'change', 'looove', 'it!', 'god', 'bless!!!']
Removal of non-alpha &
president really good keep up change looove it god bless
non-space
Stemming and
president really good keep up change looove it god bless
Lemmatizing
Replacing of repeated
president really good keep up change love it god bless
characters
The first column shows the different preprocessing 5. RECOMMENDATION
techniques used on normalized the tweet. The second It is recommended that other attributes not found in this
column is the processed tweet as it passes through the study be explored by other researchers such as
preprocessing technique. It is assumed that the tweet substitution of acronym like LOL. Other domain aside
passes through all possible process on the algorithm from microblog can also be explored such as movie and
product reviews, politics, health, sciences and more.
except on the removal of the whole hashtag. The result
Further testing can also be done regarding the
will vary depending on the requirement of the study and effectiveness of this algorithm. Language constraints can
the composition of the tweet itself. However, this shows also be considered as a new area of study.
that the algorithm is applicable in normalizing the tweet
for sentiment analysis. 6. REFERENCES
A. Agarwal et. al. “Sentiment Analysis for Twitter Data”.
4. CONCLUSION Proceedings of the Workshop on Language in Social
Different researches has been map to find any pattern Media”. 2011
that can help in the creation of a process that can be
utilized in studying sentiment analysis. With the result of Alexa. “The Top 500 Sites on the Web”.
(http://www.alexa.com/topsites) [retrieved: July 2016]
the testing, finding shows that the algorithm was
effective in normalizing tweet in preparation of creating
C. Arqueta and Y. Chen. “Multi-Lingual Sentiment
a corpus for sentiment analysis.
Analysis of Social Data Based on Emotion-Bearing
Patterns”. Proceedings of the Second Workshop on
11
Natural Language Processing For Social Media. Pages 38- S. Kanan and V. Gurusamy. “Preprocessing Techniques
43. Dublin, Ireland. 2014 for Text Mining”. Proceedings of ResearchGate
Conference on October 2014.
Y. Bao, et. al. “The Role of Pre-processing in Twitter
Sentiment Analysis”. 10th International Conference, ICIC V. Kharde and S. Sonawane. “Sentiment Analysis of
2014, Taiyuan, China, August 3-6, 2014. Twitter Data: A Survey of Techniques”. International
Journal of Computer Applications, Vol. 139, No. 11, April
B. Caitlin. “Embracing Tweeter as a Research Tool – 2016.
Social Information Research”. Emerald Group
Publishing. Library and Information Science Vol. 5, p 132. E. Kouloumpis et. al., “Twitter Sentiment Analysis: The
2012 Good the Bad and the OMG!”. Proceedings of the Fifth
International AAAI Conference on Weblogs and Social
E. Cambria, et. al., “New Avenues in Opinion Mining and
Media. 2011
Sentiment Analysis”. IEEE Computer Society. 2013
L. Mass, et. al., “Learning Word Vectors for Sentiment
A. Davies and Z. Ghahramani., “Language-independent
Analysis”., Standford University,
Bayesian Sentiment Mining on Twitter”. Proceedings of
(http://ai.stanford.edu/~ang/papers/acl11-
The 5th SNA-KDD Workshop ’11. San Diego, California.
WordVectorsSentimentAnalysis.pdf) [retrieved: July
August 2011.
2016]
J. Gantz and D. Reinsel. “The Digital Universe in 2020: Big
S. Mohammad, et. al. “NRC-Canada: Building State-of-the-
Data, Bigger Digital Shadows, and Biggest Growth in the
Far East”. IDC. Massachusetts, December 2012. Art in Sentiment Analysis of Tweets”.
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=1
A. Go et. al., “Twitter Sentiment Classification using 0.1.1.310.7022&rep=rep1&type=pdf) [retrieved: July
Distant Supervision”. Stanford University. (http://www- 2016]
cs.stanford.edu/people/alecmgo
/papers/TwitterDistantSupervision09.pdf) [retrieved: S. Narr et. al., “Language-Independent Twitter Sentiment
July 2016]. Analysis”. Berlin, Germany. 2012 (http://www.dai-
labor.de/fileadmin/files/publi
A. Gupta. “$1.00 per RT #BostonMarathon cations/narr-twittersentiment-KDML-LWA-2012.pdf)
#PrayForBoston: Analyzing Fake Content on Twitter”. [retrieved July 2016]
IBM Research Lab. Delhi, India, 2013.
A. Pak and P. Paroubek, “Twitter as a Corpus for
B. Han and T. Baldwin, “Lexical Normalization of Short Sentiment Analysis and Opinion Mining”. Dedex, France.
Text Messages: Makn sens a# twitter”. In proceedings of (http://www.lrec-conf.org/proceedings
the 49th Annual Meeting of the Association for /lrec2010/pdf/385_Paper.pdf) [retrieved: July 2016]
Computational Linguistics: Human
Language Technologies. Association for Computational G. Potdar and R. Phursule. “Twitter Blogs Mining using
Linguistics, 2011. Supervised Algorithm”. International Journal of
Computer Applications, Vol. 126, No. 5, September 2015.
I. Hemalatha, et. al. “Preprocessing the Informal Text for
Efficient Sentiment Analysis”. International Journal of J. Sharma and A. Vjas. “Twitter Sentiment Analysis”.
Emerging Trends and Technology in Computer Science, India. (http://www.cse.iitk.ac.in/users/cs365/2012/
Vol. 1, Issue 2, July-August 2012. submissions/jaysha/cs365/projects/) [retrieved: July
2016]
Internet Live Stats. “Twitter Usage Statistics”.
(http://www.internetlivestats.com/twitter-statistics/)
[retrieved: July 2016]
12
M. Taboada, et al., “Lexicon-Based Methods for
Sentiment Analysis”. (http://www.aclweb.org/
anthology/J/J11/J11-2001.pdf) [retrieved: July 2016]
V. Turner. “The Digital Universe of Opportunities: Rich
Data and the Increasing Value of the Internet of Things”.
IDC. Massachusetts, April 2014.
S. Wakade, et. al. “Text Mining for Sentiment Analysis of
Twitter Data”. Proceedings of The 2012 World Congress
in Computer Science, Computer Engineering,
and Applied Computing, 2012.
13