MODULE 4-Text Analytics.pptx

MODULE 4 : Text Analytics
CSC601.4 Analyze Text data and gain insights.

CONTENTS
● Text Mining
○ History of text mining
○ Roots of text mining
○ Overview of seven practices of text analytic
○ Application and use cases for Text mining:
■ Extracting meaning from unstructured text
■ Summarizing Text.
● Text Analysis
○ Text Analysis Steps
○ A Text Analysis Example
○ Collecting Raw Text
○ Representing Text
○ Term Frequency—Inverse Document Frequency (TFIDF)
○ Categorizing Documents by Topics
○ Determining Sentiments
○ Gaining Insights

Text Mining
● Text mining is the process of evaluating large amount of textual data to
produce meaningful information, and to convert the unstructured text data
into structured text data for further analysis and visualization.
● Text mining helps to identify unnoticed facts, relationships and assertions
of textual big data.
● The process of text mining includes two basic python libraries: textblob and
wordcloud.

Text Data
● Before doing the text mining, we need to understand the text data like
determining the number of words in the document.
● We need to first load data from different sources including text files(.txt),
pdfs (.pdf), csv files(.csv) etc.

Example Data Sources and Formats for Text Analysis

Text Pre-Processing
● Text Pre-Processing is an important phase before applying any
algorithms on text data.
● Data cleaning implies cleaning of noise such as: punctuation, spaces
etc.
● The objective of text mining is to clean the data for creating independent
terms from the data file for further analysis.
● After the textual data has been loaded in environment, it needs to be
cleaned by adopting different measures like transforming the text to
lowercase; removing specific characters like removing URLs , non-
english words, punctuations, whitespace etc.

Shallow Parsing
● Tokenization is the process of breaking down a text paragraph into smaller chunks
such as words or sentence.
● Token is a single entity that is building block for sentence or paragraph.
● Sentence tokenizer breaks text paragraph into sentences while word tokenizer breaks
text paragraph into words.
● The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts
of speech are also known as word classes or lexical categories.
● The collection of tags used for a particular task is known as a tagset.
● The emphasis in this section is on exploiting tags, and tagging text automatically.
● A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches
a part of speech tag to each word.

Stop words
● Text may contain stop words such as is, am, are, this, a, an,
the, etc.
● These stop words are considered as noise in the text and
hence should be removed.
● Before doing analysis of text data, we should filter out the list
of tokens from these stop words.

Stemming and Lemmatizing
● Stemming and Lemmatization considers another type of noise in the text, which
reduces derivationally related forms of a word to common root word.
● Stemming is the process of gathering words of similar origin into one word.
Stemming helps us to increase accuracy in our mined text by removing suffixes
and reducing words to their basic forms. For example, words like detection,
detected, detecting are reduced to a common word "detect".
● Lemmatization is usually more sophisticated than stemming and also reduces
words to their base word. But lemmar, unlike stemmer, works on an individual
word with knowledge of the context. Example, word "better" has "good" as its
lemma, but this is not included by stemming because it requires a dictionary look-
up.

Word Cloud
● For creating a visual impact, a word cloud is created from different words.
● The Word cloud is created from wordcloud library. In the word cloud, the size of
the words is dependent on their frequencies.

Sentiment Analysis
● Sentiment Analysis is also popularly known as opinion analysis or opinion mining.
The key idea is to use techniques from text analytics, NLP, machine learning and
linguistics to extract important information or data points from unstructured text.
● Sentiment analysis is a branch of machine learning that deals with interaction
between computers and humans using the natural language. Sentiment analysis
provides a way to understand the attitudes and opinions expressed in texts.
● Sentiment polarity is typically a numeric score which is assigned to both the
positive and negative aspects of a text document based on subjective parameters
like specific words and phrases expressing feelings and emotion. Neutral sentiment
typically has 0 polarity since it does not express any specific sentiment, positive
sentiment will have polarity > 0 and negative < 0.

Applications of Natural Language Processing
● With the advent of new technologies, there has been a massive growth in the
availability of text data. Thus, there are different applications of natural
language processing which may contribute to an organization's success in a
dominant manner. Example: Understanding customer behavior through twitter
data, developing recommendation systems, cluster analysis of the customer
data on the basis of reviews etc. This section focus on different applications of
natural language processing

Analyzing Twitter Data
Twitter is social networking site where people communicate in short messages
called tweets. Tweeting basically means posting short messages to people who
follows you on twitter, with an intention that the messages might be helpful for
taking a decision.

Document Similarity
Document similarity is a powerful technique used to recommend products/services,
videos, movies etc. The different examples of document similarity include ecommerce
websites recommending products on its website, Amazon Prime and Netflix
recommending moviesshows, YouTube recommending videos etc. Recommendation
for a product/service can be done according to pre-defined criterion like no. of buyers,
budget, rating, popularity, manufacturer, description etc.

Cluster Analysis
Cluster analysis can be done on text data after the feature extraction is done on
the data using vectorizer. This section performs cluster analysis on the above
data and forms clusters of different movies together on the basis of their
information stored in tfidf matrix while performing feature extraction.

Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing,
search and retrieval, and text mining.
A text analysis problem may also consist of other subtasks such as discourse
and segmentation

Parsing is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText
Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders
it in a more structured way for the subsequent steps.
Search and retrieval is the identification of the documents in a corpus that contain search items such
as specific words, phrases, topics, or entities like people or organizations. These search items are generally
called key terms. Search and retrieval originated from the field of library science and is now used exten-
sively by web search engines.

Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.

Part-of-Speech (POS) Tagging, Lemmatization, and
Stemming
The goal of POS tagging is to build a model whose input is a sentence, such
as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the
corresponding word, such as:
PRP VBD DT NN
according to the Penn Treebank POS tags . Therefore, the four words are
mapped to pronoun (personal), verb (past tense). determiner, and noun
(singular), respectively.

Both lemmatization and stemming are techniques to reduce the number of dimensions and reduce
inflections or variant forms to the base form to more accurately measure the number of times each
word appears. With the use of a given dictionary, lemmatization finds the correct dictionary base form
of a word.
For example, given the sentence:
obesity causes many problems
the output of lemmatization would be:
obesity cause many problem
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Stemming
Different from lemmatization, stemming does not need a dictionary, and it usually
refers to a crude process of stripping affixes based on a set of heuristics with the
hope of correctly achieving the goal to reduce inflections or variant forms.
After the process, words are stripped to become stems. A stem is not necessarily
an actual word defined in the natural language, but it is sufficient to differentiate
itself from the stems of other words. A well-known rule-based stemming algorithm is
Porter's stemming algorithm. It defines a set of production rules to iteratively
transform words into their stems. For the sentence shown previously:
obesity causes many problems
the output of Porter's stemming algorithm is:
obes caus mani problem

http://www.infogistics.com/posdemo.htm

https://www.link.cs.cmu.edu/cgi-bin/link/construct-page-4.cgi#submit
https://textanalysisonline.com/nltk-pos-tagging

import nltk
a = "Sample Text"
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()

Explanation of code:
1. Import nltk module.
2. Write the text whose word distribution you need to find.
3. Tokenize each word in the text which is served as input to FreqDist module of the nltk.
4. Apply each word to nlk.FreqDist in the form of a list
5. Plot the words in the graph using plot()

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)

hello
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site

● Package PorterStemer is imported from module stem
● Packages for tokenization of sentence as well as words are imported
● A sentence is written which is to be tokenized in the next step.
● Word tokenization stemming lemmatization is implemented in this step.
● An object for PorterStemmer is created here.
● Loop is run and stemming of each word is done using the object created in the code line 5
http://text-processing.com/demo/stem/

Text Analysis Example
Consider the fictitious company ACME, maker of two products: bPhone and bEbook. ACME is in
strong competition with other companies that manufacture and sell similar products. To succeed,
ACME needs to produce excellent phones and eBook readers and increase sales. One of the ways
the company does this is to monitor what is being said about ACME products in social media. In other
words, what is the buzz on its products? ACME wants to search all that is said about ACME products
in social media sites, such as Twitter and Facebook, and popular review sites, such as Amazon and
ConsumerReports. It wants to answer questions such as these.
• Are people mentioning its products?
• What is being said? Are the products seen as good or bad? If people think an ACME product is bad,
why?
For example, are they complaining about the battery life of the bPhone, or the response time
in their bEbook?
They want to monitor the social media buzz using a simple process based on the three steps

1. Collect raw text - This corresponds to Phase 1 and Phase 2 of the Data
Analytic Lifecycle.
2. Represent text - Convert each review into a suitable document representation
with proper indices, and build a corpus based on these indexed reviews. This
step corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
3. Compute the usefulness of each word in the reviews using methods such as
TFIDF .This and the following two steps correspond to Phases 3 through 5 of
the Data Analytic Lifecycle.
4. Categorize documents by topics. This can be achieved through topic models
(such as latent Dirichlet allocation).

5. Determine sentiments of the reviews. Identify whether the reviews are positive or negative.
Many product review sites provide ratings of a product with each review. If such information is
not available, techniques like sentiment analysis can be used on the textual data to infer the
underlying sentiments. People can express many emotions. To keep the process simple,
ACME considers sentiments as positive, neutral, or negative.
6. Review the results and gain greater insights - This step corresponds to Phase 5 and 6 of the
Data Analytic Lifecycle. Marketing gathers the results from the previous steps. Find out
what exactly makes people love or hate a product. Use one or more visualization techniques
to report the findings. Test the soundness of the conclusions and operationalize the findings if
applicable.

Collecting Raw Text
The Data Science team starts by actively monitoring various websites for user-generated
contents. The user-generated contents being collected could be related articles from news portals and
blogs, comments on ACME's products from online shops or reviews sites, or social media posts that contain
keywords b Phone or bEbook. Regardless of where the data comes from, it's likely that the team would
deal with semi-structured data such as HTML web pages, Really Simple Syndication (RSS) feeds, XML, or
JavaScript Object Notation (JSON) files. Enough structure needs to be imposed to find the part of the raw
text that the team really cares about. In the brand management example, ACME is interested in what the
reviews say about bPhon e or bEb ook and when the reviews are posted. Therefore, the team will actively
collect such information.

Many websites and services offer public APIs for third-party developers to
access their data.
For example, the Twitter API allows developers to choose from the Streaming
API or the REST API to retrieve public Twitter posts that contain the keywords
bPhone or bEbook.
Developers can also read tweets in real time from a specific user or tweets
posted near a specific venue. The fetched tweets are in the JSON format.

Many news portals and blogs provide data feeds that are in an open standard
format, such as RSS or XML.

Representing Text
ln this data representation step, raw text is first transformed with text normalization techniques
such as tokenization and case folding.
Then it is represented in a more structured way for analysis.
Tokenization is the task of separating (also called tokenizing) words from the body of text.
Raw text is converted into collections of tokens after the tokenization, where each token is
generally a word.
A common approach is tokenizing on spaces.
state-of-art

Representing Text
Another text normalization technique is called case folding, which reduces all letters to
lowercase (or the opposite if applicable).
One needs to be cautious applying case folding to tasks such as information extraction,
sentiment analysis, and machine translation.
If implemented incorrectly, case folding may reduce or change the meaning of the text and
create additional noise.
For example, when General Motors becomes general and motors, the downstream analysis
may very likely consider them as separated words rather than the name of a company.
When the abbreviation of the World Health Organization WHO or the rock band The Who
become who, they may both be interpreted as the pronoun who.

Representing Text
If case folding must be present, one way to reduce such problems is to create a
lookup table of words not to be case folded.
The team can come up with some heuristics or rules-based strategies for the
case folding.
For example, the program can be taught to ignore words that have uppercase
in the middle of a sentence.

Representing Text
After normalizing the text by tokenization and case folding, it needs to be
represented in a more structured way.
A simple yet widely used approach to represent text is called bag-of-words.
Given a document, bag-of-words represents the document as a set of terms,
ignoring information such as order, context, inferences, and discourse.
Each word is considered a term or token (which is often the smallest unit for the
analysis).
In many cases, bag-of-words additionally assumes every term in the document is
independent.

Representing Text
The document then becomes a vector with one dimension for every distinct
term in the space, and the terms are unordered.
The permutation 0* of a document D contains the same words exactly the
same number of times but in a different order.
Therefore, using the bag-of-words representation, document D and its
permutation D* would share the same representation.

Representing Text
Bag-of-words takes quite a na..-ve approach, as order plays an important role in the semantics of text.
With bag-of-words, many texts with different meanings are combined into one form.
For example, the texts
"a dog bites a man"
and
"a man bites a dog"
have very different meanings, but they would share the same representation with bag-of-words.

Representing Text
Using single words as identifiers with the bag-of-words representation, the term
frequency (TF) of each word can be calculated.
Term frequency represents the weight of each term in a document, and it is
proportional to the number of occurrences of the term in that document.

Representing Text
Besides extracting the terms, their morphological features may need to be
included.
The morphological features specify additional information about the terms,
which may include root words, affixes, part-of-speech tags, named entities, or
intonation (variations of spoken pitch).
The features from this step contribute to the downstream analysis in
classification or sentiment analysis.

Representing Text
The set of features that need to be extracted and stored highly depends on the
specific task to be performed.
lf the task is to label and distinguish the part of speech, for example, the features
will include all the words in the text and their corresponding part-of-speech tags.
If the task is to annotate the named entities like names and organizations, the
features highlight such information appearing in the text.
Constructing the features is no trivial task; quite often this is done entirely manual
ly, and sometimes it requires domain expertise.

Representing Text
Sometimes creating features is a text analysis task all to itself.
One such example is topic modeling.
Topic modeling provides a way to quickly analyze large volumes of raw text and identify the
latent topics.
Topic modeling may not require the documents to be labeled or annotated.
It can discover topics directly from an analysis of the raw text. A topic consists of a cluster of
words that frequently occur together and that share the same theme.
Probabilistic topic modeling,is a suite of algorithms that aim to parse large archives of
documents and discover and annotate the topics.

Representing Text
It is important not only to create a representation of a document but also to
create a representation of a corpus.
A corpus is a collection of documents.
A corpus could be so large that it includes all the documents in one or more
languages, or it could be smaller or limited to a specific domain, such as
technology, medicine, or law.
For a web search engine, the entire World Wide Web is the relevant corpus.
Most corpora are much smaller. The Brown Corpus

Representing Text
Many corpora focus on specific domains.
For example, the BioCreative corpora are from biology, the Switchboard corpus contains
telephone conversations, and the European Parliament Proceedings Parallel Corpus was
extracted from the proceedings of the European Parliament in 21 European languages.
Most corpora come with metadata, such as the size of the corpus and the domains from which
the text is extracted.
Some corpora (such as the Brown Corpus) include the information content of every word
appearing in the text.

Representing Text
Information content (IC) is a metric to denote the importance of a term in a corpus.
The conventional way of measuring the IC of a term is to combine the knowledge of its
hierarchical
structure from an ontology with statistics on its actual usage in text derived from a corpus.
Terms with higher IC values are considered more important than terms with lower IC values.
For example, the word necklace generally has a higher IC value than the word jewelry in an
English corpus because jewelry is more general and is likely to appear more often than
necklace.
IC can help measure the semantic similarity of terms , such measures do not require an
annotated corpus, and they generally achieve strong correlations with human judgment.

Term Frequency-Inverse Document Frequency (TFIDF)
TFIDF, a measure widely used in information retrieval and text analysis.
Instead of using a traditional corpus as a knowledge base
TFIDF directly works on top of the fetched documents and treats these
documents as the "corpus."
TFIDF is robust and efficient on dynamic content, because document changes
require only the update of frequency counts.

To understand how the term frequency is computed, consider a bag-of-words
vector space of 10 words:
i, love, acme, my, bebook, bphone, fantastic, slow, terrible, and terrific .

the logarithm can be applied to word frequencies whose distribution also
contains a long tail, as shown in Equation

Because longer documents contain more terms, they tend to have higher term
frequency values.
They also tend to contain more distinct terms.
These factors can conspire to raise the term frequency values of longer
documents and lead to undesirable bias favoring longer documents.
To address this problem, the term frequency can be normalized. For example,
the term frequency of term t in document d can be normalized based on the
number of terms in d as shown in Equation

A term frequency vector can become very high dimensional because the bag-
of-words vector space can grow substantially to include all the words in
English.
The high dimensionality makes it difficult to store and parse the text and
contribute to performance issues related to text analysis.

For the purpose of reducing dimensionality, not all the words from a given language
need to be included in the term frequency vector. In English, for example, it is
common to remove words such as
the, a, of, and, to, and other articles that are not likely to contribute to semantic
understanding.
These common words are called stop words.
Lists of stop words are available in various languages for automating the
identification of stop words. Among them is the Snowball's stop words list that
contains stop words
in more than ten languages.

Term Frequency-Inverse Document Frequency
(TFIDF)
Another simple yet effective way to reduce dimensionality is to store a term and
its frequency only if the term appears at least once in a document.
Any term not existing in the term frequency vector by default will have a
frequency of 0.
Therefore, the previous term frequency vector would be simplified to what is
shown in Table

Some NLP techniques such as lemmatization and stemming can also reduce high dimensionality.
Lemmatization and stemming are two different techniques that combine various forms of a word.
With these techniques, words such as play, plays, played, and playing can be mapped to the same term.
It has been shown that the term frequency is based on the raw count of a term occurring in a stand-
alone document.
Term frequency by itself suffers a critical problem: It regards that stand-alone document
as the entire world.
The importance of a term is solely based on its presence in this particular document.

Stop words such as the, and, and a could be inappropriately considered the
most important because they have the highest frequencies in every document.
For example, the top three most frequent words in Shakespeare's Hamlet are
all stop words {t he, and, and of,
Besides stop words, words that are more general in meaning tend to appear
more often, thus having higher term frequencies.
In an article about consumer telecommunications, the word phone would be
likely to receive a high term frequency.

As a result, the important keywords such as b Phone and bEbook and their
related words could appear to be less important.
Consider a search engine that responds to a search query and fetches relevant
Documents.
Using term frequency alone, the search engine would not properly assess how
relevant each document is in relation to the search query.

A quick fix for the problem is to introduce an additional variable that has a
broader view of the world considering the importance of a term not only in a
single document but in a collection of documents, or in a corpus.
The additional variable should reduce the effect of the term frequency as the
term appears in more documents. That is the intention of the inverted
document frequency (IDF).

The IDF inversely corresponds to the document frequency {DF}, which is
defined to be the number of documents in the corpus that contain a term.
Let a corpus D contain N documents. The document frequency of a term t in
corpus
D = {d1,d2 , •• • d11 } is defined as shown in Equation

The Inverse document frequency of a term t is obtained by dividing N by the
document frequency of the term and then taking the logarithm of that quotient,
as shown in Equation

If the term is not in the corpus, it leads to a division-by-zero. A quick fix is to
add 1 to the denominator, as demonstrated in Equation

Categorizing Documents by Topics
A topic consists of a cluster of words that frequently occur together and share the same theme.
The topics of a document are not as straightforward as they might initially appear. Consider these two
reviews:
1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G.
2 . While I love ACME's bPhone series, I've been quite disappointed by the bEbook.
The text is illegible, and it makes even my old NBook look blazingly fast.
Is the first review about bPhone5x or bPhone4G? Is the second review about bPhone, bEbook, or
NBook?
For machines, these questions can be difficult to answer.

If a review is talking about bPhoneSx, the term bPhoneSx and related terms
(such as phone and ACME) are likely to appear frequently.
A document typically consists of multiple themes running through the text in
different proportions-
for example, 30% on a topic related to phones, 15% on a topic related to
appearance, 10% on a topic related to shipping, 5% on a topic related to
service, and so on.

Document grouping can be achieved with clustering methods such as k-means
clustering or classification methods such as support vector machines . k-
nearest neighbors or Naive Bayes .
However, a more feasible and prevalent approach is to use topic modeling.
Topic modeling provides tools to automatically organize, search, understand,
and summarize from vast amounts of information.

Topic models are statistical models that examine words from a set of
documents, determine the themes over the text, and discover how the themes
are associated or change over time.
The process of topic modeling can be simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.

A topic is formally defined as a distribution over a fixed vocabulary of words.
Different topics would have different distributions over the same vocabulary.
A topic can be viewed as a cluster of words with related meanings, and each word
has a corresponding weight inside this topic.
Note that a word from the vocabulary can reside in multiple topics with different
weights.
Topic models do not necessarily require prior knowledge of the texts.
The topics can emerge solely based on analyzing the text.

The simplest topic model is latent Dirichlet allocation (LDA) a generative
probabilistic model of a corpus proposed by David M. Blei and two other
researchers.
In generative probabilistic modeling, data is treated as the result of a generative
process that includes hidden variables.
LDA assumes that there is a fixed vocabulary of words, and the number of the
latent topics is predefined and remains constant.
LDA assumes that each latent topic follows a Dirichlet distribution over the
vocabulary, and each document is represented as a random mixture of latent
topics.

Figure illustrates the intuitions behind LDA.

The left side of the figure shows four topics built from a corpus, where each topic contains a list
of the most important words from the vocabulary.
The four example topics are related to problem, policy, neural, and report.
For each document, a distribution over the topics is chosen, as shown in the histogram on the
right.
Next, a topic assignment is picked for each word in the document, and the word from the
corresponding topic (colored discs) is chosen.
In reality, only the documents (as shown in the middle of the figure) are available. The goal of
LDA is to infer the underlying topics, topic proportions, and topic assignments for every
document.

Topic models can be used in document modeling, document classification, and
collaborative filtering
Topic models not only can be applied to textual data, they can also help
annotate images.
Just as a document can be considered a collection of topics, images can be
considered a collection of image features.

Determining Sentiments
Sentiment analysis refers to a group of tasks that use statistics and natural
language processing to mine opinions to identify and extract subjective information
from texts.
Early work on sentiment analysis focused on detecting the polarity of product
reviews from Epinions and movie reviews from the Internet Movie Database (IMDb)
at the document level.
Later work handles sentiment analysis at the sentence level . More recently, the
focus has shifted to phrase-level and short-text forms in response to the popularity
of micro-blogging services such as Twitter

One can manually construct lists of words with positive sentiments (such as
brilliant, awesome, and spectacular) and negative sentiments (such as awful,
stupid, and hideous).
Related work has pointed out that such an approach can be expected to achieve
accuracy around 60% , and it is likely to be outperformed by examination of
corpus statistics.

Classification methods such as naive Bayes, maximum entropy (MaxEnt), and
support vector machines (SVM) are often used to extract corpus statistics for
sentiment analysis.
Related research has found out that these classifiers can score around 80%
accuracy on sentiment analysis over unstructured data.
One or more of such classifiers can be applied to unstructured data, such
as movie reviews or even tweets.

The movie review corpus by Pang et al. includes 2,000 movie reviews collected from an IMDb
archive of the rec.arts.movies.reviews newsgroup.
These movie reviews have been manually tagged into 1,000 positive reviews and 1,000
negative reviews.
Depending on the classifier, the data may need to be split into training and testing sets.
A rule of the thumb for splitting data is to produce a training set much bigger
than the testing set.
For example, an 80/20 split would produce 80% of the data as the training set and 20% as the
testing set.

One or more classifiers are trained over the training set to learn the
characteristics or patterns residing in the data.
The sentiment tags in the testing data are hidden away from the classifiers.
After the training, classifiers are tested over the testing set to infer the
sentiment tags.
Finally, the result is compared against the original sentiment tags to evaluate
the overall performance of the classifier.

A confusion matrix is a specific table layout that allows visualization of the
performance of a model over the testing set.

Precision and recall are two measures commonly used to evaluate tasks
related to text analysis.
Definitions of precision and recall are given in Equations

Precision is defined as the percentage of documents in the results that are
relevant. If by entering keyword bPhone, the search engine returns 100
documents, and 70 of them are relevant, the precision of the search engine
result is 0.7%.
Recall is the percentage of returned documents among all relevant documents
in the corpus. If by entering keyword bPhone, the search engine returns 100
documents, only 70 of which are relevant while failing to return 10 additional,
relevant documents, the recall is 70/ (70+ 10) = 0.875.

Precision and recall are important concepts, whether the task is about
information retrieval of a search engine or text analysis over a finite corpus.
A good classifier ideally should achieve both precision and recall close to 1.0

Gaining Insights
Corresponding to the data collection phase, the Data Science team has used
bPhone as the keyword to collect more than 300 reviews from a popular
technical review website.
The 300 reviews are visualized as a word cloud after removing stop words. A
word cloud (or tag cloud) is a visual representation of textual data.
Tags are generally single words, and the importance of each word is shown
with font size or color.

IMPORTANT QUESTIONS
1. What are the main challenges of text analysis?
2. What is a corpus?
3. What are common words (such as a, and, of) called?
4. Why can't we use TF alone to measure the usefulness of the words?
5. What is a caveat of IDF? How does TFIDF address the problem?
6. Name three benefits of using the TFIDF.
7. What methods can be used for sentiment analysis?
8. What is the definition of topic in topic models?
9. Explain the trade-offs for precision and recall.

MODULE 4-Text Analytics.pptx

More Related Content

What's hot

Similar to MODULE 4-Text Analytics.pptx

More from nikshaikh786

Recently uploaded

In this document

MODULE 4-Text Analytics.pptx