Chapter Two
Text Operations
1
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
A few words are very common.
◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”
3
Sample Word Frequency Data
4
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.
Zipf's Law states that when the distinct words in a text
are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant
That is If the words, w, in a collection are ranked, r,
by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.
6
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law
w has rank r and
frequency f
7
Example: Zipf's Law
The table shows the most frequently occurring words
from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
For this, Luhn specifies two cut-off points: an upper and a
lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
The words exceeding the upper cut-off were considered to be
common
The words below the lower cut-off were considered to be rare
Hence they are not contributing significantly to the content of the
text
The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.
14
Text Operations
Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content
Therefore, one needs to preprocess the text of a
document in a collection to be used as index terms 15
Text Op….
Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term
16
Text operations is the process of text transformations in to logical
representations
The main operations for selecting index terms are:
Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters
Elimination of stop words - filter out words which are not useful in the retrieval
process
Stemming words - remove affixes (prefixes and suffixes)
Construction of term categorization structures such as thesaurus/wordlist, to capture
relationship for allowing the expansion of the original query with related terms
17
Generating Document Representatives
Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus
Index
terms 18
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted
as index terms
Objective - identify words in the text
◦ Digits, hyphens, punctuation marks, case of letters
◦ Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
19
Lexical Analysis…..
Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens
Punctuation marks – remove totally unless significant,
e.g. program code: x.exe and xexe
Case of letters – not important and can convert all to
upper or lower
20
Analyze text into a sequence of discrete tokens (words).
Tokenization Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
◦ Friends , and, Romans, Countrymen
Each such token is now a candidate for an index entry,
after further processing
But what are valid tokens to omit? 21
One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
San Francisco, Los Angeles
Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
data base, database, data-base
• Numbers:
dates (3/12/91 vs. Mar. 12, 1991);
phone numbers,
IP addresses (100.2.86.144)
22
Issues in Tokenization
How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
23
Issues in Tokenization
Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately
Issues of tokenization are language specific
◦ Requires the language to be known
24
Similarity Measure
A similarity measure is a function that computes
the degree of similarity between two vectors.
Using a similarity measure between the query
and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
25