KEMBAR78
Text Representation Techniques Guide | PDF | Cognitive Science | Applied Mathematics
0% found this document useful (0 votes)
36 views4 pages

Text Representation Techniques Guide

question and answer for text representation

Uploaded by

fatmahelawden000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views4 pages

Text Representation Techniques Guide

question and answer for text representation

Uploaded by

fatmahelawden000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Chapter ( 3 ) with Answer

1- List the four categories of text representation techniques. ?


text representation : We’re given a piece of text, and we’re asked to find a scheme to represent it
mathematically.

These approaches are classified into four categories :


1- Basic vectorization approaches
2- Distributed representations
3- Universal language representation
4- Handcrafted features
2- Describe the concept vector space models . ?
in order for ML algorithms to work with text data, the text data must be converted into some mathematical
form. represent text units (characters, phonemes, words, phrases, sentences, paragraphs, and documents) with
vectors of numbers. This is known as the vector space model (VSM)

VSM → It’s a mathematical model that represents text units as vectors


3- Use “D1: Dog bites man, D2: Man bites dog, D3: Dog eats meat, and D4: Man eats food” as an
input, find their representation using one-hot encoding, bag of words, bag of N-gram, and TF-IDF. ?
One-Hot Encoding
each of the six words to unique
IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6.ii
Let’s consider the document D1: “dog bites man”. As per the scheme, each word is a six-dimensional
vector. Dog is represented as [1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites
is represented as [0 1 0 0 0 0], and so on and so forth.
Thus, D1 is represented as [ [1 00 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 0]
[0 0 0 0 1 0] [0 0 00 0 1]].

Bag of Words
D1 becomes [1 1 1 0 0 0]. This is because the first
three words in the vocabulary appeared exactly once in D1,and the last three did not
appear at all. D4 becomes [0 0 1 0 1 1].

Bag of N-Grams
D1 : [1,1,0,0,0,0,0,0],
D2 : [0,0,1,1,0,0,0,0]. The other two documents follow similarly. Note that the BoW
scheme is a special case of the BoN scheme, with n=1. n=2 is called a “bigram model,”
and n=3 is called a “trigram model.”

TF-IDF

4- Explain the difference between (a) distributional similarity and distributional hypothesis (b)
distributional representation and distributed representation.?
 distributional similarity → This is the idea that the meaning of a word can be understood from the context
in which the word appears. This is also known as connotation: meaning is defined by context
 distributional hypothesis → hypothesizes that words that occur in similar contexts have similar meanings.
 distributional representation → This refers to representation schemes that are obtained based on
distribution of words from the context in which the words appear
 distributed representation → the vectors in distributional representation are very high dimensional. This
makes them computationally inefficient and hampers learning. To alleviate this, distributed representation
schemes significantly compress the dimensionality
5- Describe the wording embedding concept with an example of its use. ?
A word embedding is a learned representation for text where words that have the same meaning have a
similar representation.

Example → If we’re given the word “USA,” distributionally similar words could be other countries (e.g.,
Canada, Germany, India, etc.) or cities in the USA. If we’re given the word “beautiful,” words that share some
relationship with this word (e.g., synonyms, antonyms) could be considered distributionally similar words. These
are words that are likely to occur in similar contexts.
“Word2vec,” based on “distributional similarity,” can capture word analogy relationships
6- Explain with an example the two architectural variants of Word2vec: CBOW and SkipGram.?
CBOW → In CBOW, the primary task is to build a language model that correctly predicts the center word
given the context
SkipGram → SkipGram is very similar to CBOW, with some minor changes. In Skip‐ Gram, the task is to
predict the context words from the center word.

7- How the OOV problem can be solved ?


create vectors that are initialized randomly, where each component is between –0.25 to +0.25
There are also other approaches that handle the OOV problem by modifying the training process by bringing in
characters and other subword-level , can handle the OOV problem by using subword information, such as
morphological properties (e.g., prefixes, suffixes, word endings, etc.), or by using character representations.
8- What is the difference between Doc2vec and Word2vec ?
Word2vec → learned representations for words, and we aggregated them to form text representations.
fastText → learned representations for character n-grams.

Doc2vec, which allows us to directly learn the representations for texts of arbitrary lengths (phrases,
sentences, paragraphs, and documents) by taking the context of words in the text into account.
9- What are the important aspects to keep in mind while using word embeddings ?
All text representations are inherently biased based on what they saw in training data.
We still need ways to encode specific aspects of text, the relationships between sentences in it , pre-trained
embeddings are generally largesized files (several gigabytes), which may pose problems in certain deployment
scenarios.
10- How high-dimensional data can be represented visually ?
t-distributed Stochastic Neighboring Embedding. It’s a technique used for visualizing high-dimensional data
like embeddings by reducing them to two or three-dimensional data.
11- With example explain the use of handcrafted feature representations ?

TextEvaluator → It’s software developed by Educational Testing Service. The goal of this tool is to help
teachers and educators provide support in choosing grade-appropriate reading materials for students and
identifying sources of comprehension difficulty in texts.

measures such as “syntactic complexity” and “concreteness” etc., cannot be calculated by only converting text
into BoW or embedding representations. They have to be designed manually, keeping in mind both the domain
knowledge and the ML algorithms to train the NLP models. This is why we call these handcrafted feature
representations.

You might also like