KEMBAR78
Data Science & Analytics , Computer Science | PPTX
NADAR SARASWATHI COLLEGE
OF ARTS AND SCIENCE
DATA SCIENCE & ANALYTICS
C.Murugeswari
II M.Sc Computer Science
Text Analysis:
Text analysis is the process of using computer systems to read and understand
human-written text for business insights. Text analysis software can independently classify, sort,
and extract information from text to identify patterns, relationships, sentiments, and other
actionable knowledge. You can use text analysis to efficiently and accurately process multiple
text-based sources such as emails, documents, social media content, and product reviews, like a
human would.
the steps involved in analyzing an unstructured text document are:
Language identification
Tokenization
Sentence breaking
Part of speech tagging
Chunking
Syntax parsing
Sentence chaining
1)Language identification:
The first step is to identify in which language the text is written. Since each
language has its own rules of grammer, language identification is a major process for every
text analytics function. Its very important to know what language we will be dealing with.
2)Tokenization:
Tokenization is language-specific each language will have its own requirements.
Most alphabetic languages use whitespace and punctuation to indicate tokens in a sentence.
Logographic languages which are character-based like chinese, use other systems.
3)Sentence Breaking:
Once the tokens are identified, we can understand where the sentences are going
to end. Small texts like tweets or statuses contain only single sentences most huge, lf the
time. But huge, longer documents will require sentence breaking to separate eaccg
statement. In some documents,each sentence will be separated by a single punctuation
mark. But some might contain punctuation marks that do not mean it’s the end of the
statement.
4)Part of speech tagging:
Part of Speech tagging (or PoS Tagging) is the process of determining the part
of speech of every token in a document, and then tagging it as such.
For Example, we use PoS tagging to figure out whether a given token represents
a proper noun or a common noun, or if it’s a verb, an adjective, or something else
entirely.
PoS tagging means assigning parts of speech to tokens.
5)Chunking:
Chunking or light parsing refers to a range of sentence breaking systems that
fragment a sentence into its component phrases.
Chunking is different than part of speech tagging in text analytics. PoS tagging
assigns parts of speech to tokens whereas chunking assigns PoS-tagged tokens to
phrases.
6)Syntax parsing:
Syntax parsing is the process of determining how a sentence is formed. It is a
critical step in sentiment analysis and other natural language processing features.
7)Sentence chaining:
The final step is sentence chaining. Sentence chaining uses a teechnique to link
individual sentences using eacg sentences strength of association to an overall topic.
EXAMPLE OF TEXT ANALYSIS:
SENTIMENT ANALYSIS OF MOVIE REVIEWS
 DATA COLLECTION:
You have a dataset of movie reviews, each labeled with a sentiment(e.g., positive or
negative).
Ex:
i)The movie was fantastic and full of action-positive
ii)I didn’t enjoy the film; it was too slow-negative
 PREPROCESSING:
Clean and prepare the text data:
lowercasing: convert all text to lowercase.
tokkenization: break text into individual words.
Ex:
i)Original: “the movie was fantastic and full of action.
ii)preprocessed:[“movie”,”fantastic”,”full”,”action”]
 FEATURE EXTRACTION:
convert the text into numerical features:
Bag of words: create a matrix where each row is a review and each column is a word from
the corpus. The values are word counts or binary indicators.
Sample martrix:
Review ID Movie Fantastic Full Action Enjoy Experienc
e
1
2
3
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
1
 MODEL TRAINING:
Train a machine learning model using labeled data.
for instance, use a logistic regression model to classify sentiment:
i)Feature matrix(X): the matrix created from bag of words.
ii)Labels (Y): sentiment labels (positive, negative).
 PREDICTION:
Apply the trained model to new reviews to pedict their sentiment.
Example prediction:
i)New review:”An excelllent film with a brilliant story!”
ii)Prediction:Positive
TF-IDF
 TF-IDF stands for Term Frequency Inverse Document Frequency of records.
 It can be defined as the calculation of how relevant a word in a series or corpus is to a text.
 The meaning increases proportionally to the number of times in the text a word appears but is
compensated by the word frequency in the corpus (data-set).
TF-IDF FORMULA
THANK YOU!!....

Data Science & Analytics , Computer Science

  • 1.
    NADAR SARASWATHI COLLEGE OFARTS AND SCIENCE DATA SCIENCE & ANALYTICS C.Murugeswari II M.Sc Computer Science
  • 2.
    Text Analysis: Text analysisis the process of using computer systems to read and understand human-written text for business insights. Text analysis software can independently classify, sort, and extract information from text to identify patterns, relationships, sentiments, and other actionable knowledge. You can use text analysis to efficiently and accurately process multiple text-based sources such as emails, documents, social media content, and product reviews, like a human would. the steps involved in analyzing an unstructured text document are: Language identification Tokenization Sentence breaking Part of speech tagging Chunking Syntax parsing Sentence chaining
  • 3.
    1)Language identification: The firststep is to identify in which language the text is written. Since each language has its own rules of grammer, language identification is a major process for every text analytics function. Its very important to know what language we will be dealing with. 2)Tokenization: Tokenization is language-specific each language will have its own requirements. Most alphabetic languages use whitespace and punctuation to indicate tokens in a sentence. Logographic languages which are character-based like chinese, use other systems. 3)Sentence Breaking: Once the tokens are identified, we can understand where the sentences are going to end. Small texts like tweets or statuses contain only single sentences most huge, lf the time. But huge, longer documents will require sentence breaking to separate eaccg statement. In some documents,each sentence will be separated by a single punctuation mark. But some might contain punctuation marks that do not mean it’s the end of the statement.
  • 4.
    4)Part of speechtagging: Part of Speech tagging (or PoS Tagging) is the process of determining the part of speech of every token in a document, and then tagging it as such. For Example, we use PoS tagging to figure out whether a given token represents a proper noun or a common noun, or if it’s a verb, an adjective, or something else entirely. PoS tagging means assigning parts of speech to tokens. 5)Chunking: Chunking or light parsing refers to a range of sentence breaking systems that fragment a sentence into its component phrases. Chunking is different than part of speech tagging in text analytics. PoS tagging assigns parts of speech to tokens whereas chunking assigns PoS-tagged tokens to phrases.
  • 5.
    6)Syntax parsing: Syntax parsingis the process of determining how a sentence is formed. It is a critical step in sentiment analysis and other natural language processing features. 7)Sentence chaining: The final step is sentence chaining. Sentence chaining uses a teechnique to link individual sentences using eacg sentences strength of association to an overall topic.
  • 6.
    EXAMPLE OF TEXTANALYSIS: SENTIMENT ANALYSIS OF MOVIE REVIEWS  DATA COLLECTION: You have a dataset of movie reviews, each labeled with a sentiment(e.g., positive or negative). Ex: i)The movie was fantastic and full of action-positive ii)I didn’t enjoy the film; it was too slow-negative  PREPROCESSING: Clean and prepare the text data: lowercasing: convert all text to lowercase. tokkenization: break text into individual words. Ex: i)Original: “the movie was fantastic and full of action. ii)preprocessed:[“movie”,”fantastic”,”full”,”action”]
  • 7.
     FEATURE EXTRACTION: convertthe text into numerical features: Bag of words: create a matrix where each row is a review and each column is a word from the corpus. The values are word counts or binary indicators. Sample martrix: Review ID Movie Fantastic Full Action Enjoy Experienc e 1 2 3 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1
  • 8.
     MODEL TRAINING: Traina machine learning model using labeled data. for instance, use a logistic regression model to classify sentiment: i)Feature matrix(X): the matrix created from bag of words. ii)Labels (Y): sentiment labels (positive, negative).  PREDICTION: Apply the trained model to new reviews to pedict their sentiment. Example prediction: i)New review:”An excelllent film with a brilliant story!” ii)Prediction:Positive
  • 9.
    TF-IDF  TF-IDF standsfor Term Frequency Inverse Document Frequency of records.  It can be defined as the calculation of how relevant a word in a series or corpus is to a text.  The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).
  • 10.
  • 11.