0% found this document useful (0 votes)

20 views25 pages

IR Chapter 2 Text Operations

Chapter Two discusses statistical properties of text, focusing on word frequency distribution and its implications for information retrieval systems. It introduces concepts such as Zipf's Law, Luhn's Ideas on word significance, and Heaps' Law regarding vocabulary size growth. The chapter emphasizes the importance of text preprocessing and tokenization in improving retrieval performance by filtering out non-significant words.

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views25 pages

IR Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.

◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text

are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,

by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a

lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a

document in a collection to be used as index terms 15
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

16
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:

 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

relationship for allowing the expansion of the original query with related terms
17
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 18
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

but 510 B.C. – unique
19
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,

e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower
20
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input: “Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

after further processing

 But what are valid tokens to omit? 21

 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
22
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
23
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific

◦ Requires the language to be known

24
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query

and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
25

2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
5-Retrieval Effectiveness
No ratings yet
5-Retrieval Effectiveness
20 pages
Retrieval Evaluation in IR Systems
No ratings yet
Retrieval Evaluation in IR Systems
28 pages
FIS Unit Three
No ratings yet
FIS Unit Three
23 pages
Lec21-22 Programming in C++ Variables & Data Types-1
No ratings yet
Lec21-22 Programming in C++ Variables & Data Types-1
33 pages
Chapter 4.4 Application Layers
No ratings yet
Chapter 4.4 Application Layers
22 pages
Chap02 - Basic Elements of C++
No ratings yet
Chap02 - Basic Elements of C++
41 pages
C# Chapter 2
No ratings yet
C# Chapter 2
23 pages
Lecture 01 Introduction
No ratings yet
Lecture 01 Introduction
84 pages
Unit 1 (C++) - Introduction
No ratings yet
Unit 1 (C++) - Introduction
91 pages
Chapter 05
No ratings yet
Chapter 05
48 pages
Chapter 2
No ratings yet
Chapter 2
20 pages
Chapter 2 - Data Communication and Network Components
No ratings yet
Chapter 2 - Data Communication and Network Components
66 pages
Chapter-2 SN
No ratings yet
Chapter-2 SN
68 pages
SAD PPt-1,2,3
No ratings yet
SAD PPt-1,2,3
39 pages
Chapter 4-Protocols
No ratings yet
Chapter 4-Protocols
45 pages
Lecture-02-Basic Elements of C++
100% (1)
Lecture-02-Basic Elements of C++
82 pages
Intro to Basic C++ Programming
No ratings yet
Intro to Basic C++ Programming
32 pages
Basic Programing I Chapter 1
No ratings yet
Basic Programing I Chapter 1
48 pages
EDP Part 1
No ratings yet
EDP Part 1
42 pages
Chapter 3 - C++ Handout
No ratings yet
Chapter 3 - C++ Handout
55 pages
UNit 5
No ratings yet
UNit 5
50 pages
ICT Lecturer 6
No ratings yet
ICT Lecturer 6
17 pages
OSI Layers
No ratings yet
OSI Layers
70 pages
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
No ratings yet
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
55 pages
2 Data Communications Concepts
No ratings yet
2 Data Communications Concepts
15 pages
OOP2 Lecture Week 12 (Spring2023 24)
No ratings yet
OOP2 Lecture Week 12 (Spring2023 24)
19 pages
Systems Analysis & Design Guide
No ratings yet
Systems Analysis & Design Guide
45 pages
Programming Paradigms-1-79
No ratings yet
Programming Paradigms-1-79
79 pages
C# Chapter 5
No ratings yet
C# Chapter 5
11 pages
C# Multiform & Database Guide
No ratings yet
C# Multiform & Database Guide
25 pages
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
No ratings yet
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
15 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Chapter 1 SAD Introduction
No ratings yet
Chapter 1 SAD Introduction
25 pages
Chapter Six - Pointer
No ratings yet
Chapter Six - Pointer
11 pages
Wind Load Calculation
No ratings yet
Wind Load Calculation
6 pages
Chapter 6
No ratings yet
Chapter 6
28 pages
Chapter 3
No ratings yet
Chapter 3
24 pages
C++ Chapter Three
No ratings yet
C++ Chapter Three
55 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
C++ Note 2
100% (1)
C++ Note 2
154 pages
C# Chapter 4
No ratings yet
C# Chapter 4
8 pages
Final Exam Instruction For Students
No ratings yet
Final Exam Instruction For Students
1 page
CP Chapter 2
No ratings yet
CP Chapter 2
51 pages
Basics
No ratings yet
Basics
118 pages
CNS - Unit 1
No ratings yet
CNS - Unit 1
98 pages
C#.Net Programming Basics
No ratings yet
C#.Net Programming Basics
51 pages
Chapter Three - Programming Constructs
No ratings yet
Chapter Three - Programming Constructs
55 pages
Chapter 2.2
No ratings yet
Chapter 2.2
46 pages
WinForms GUI Programming Guide
No ratings yet
WinForms GUI Programming Guide
75 pages
C++ Functions for Students
No ratings yet
C++ Functions for Students
58 pages
CH 1 C++
No ratings yet
CH 1 C++
17 pages
Chapter 1 - Overview of Computer Networks
No ratings yet
Chapter 1 - Overview of Computer Networks
49 pages
Ch5 Retrieval Evaluation 2021
No ratings yet
Ch5 Retrieval Evaluation 2021
26 pages
Compiler Assignment
No ratings yet
Compiler Assignment
11 pages
CH 5
No ratings yet
CH 5
27 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Multi Media Material
No ratings yet
Multi Media Material
101 pages
Chapter 4
No ratings yet
Chapter 4
83 pages
Red It
No ratings yet
Red It
30 pages
Chapter 3
No ratings yet
Chapter 3
90 pages
AI Problem Solving Techniques
No ratings yet
AI Problem Solving Techniques
52 pages
Chapter 2
No ratings yet
Chapter 2
58 pages
Chapter 1 Part 2
No ratings yet
Chapter 1 Part 2
60 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
IT-Chapter 1 B PPT 2015-20
No ratings yet
IT-Chapter 1 B PPT 2015-20
21 pages
TCP/IP Basics for Network Admins
No ratings yet
TCP/IP Basics for Network Admins
38 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
4 Year 2 Semester Final Exam Schedule
No ratings yet
4 Year 2 Semester Final Exam Schedule
2 pages
Session Validation
No ratings yet
Session Validation
2 pages
IT Chapter 5 2015
No ratings yet
IT Chapter 5 2015
41 pages
IT Chapter 4 2015
100% (1)
IT Chapter 4 2015
30 pages
Ethics ch1
No ratings yet
Ethics ch1
25 pages
IT Chapter 2 2015
No ratings yet
IT Chapter 2 2015
26 pages
IT Chapter 6 2015
No ratings yet
IT Chapter 6 2015
20 pages
Model Exam For Remedial Alliance
No ratings yet
Model Exam For Remedial Alliance
4 pages
Emerging Technologies Exam Guide
100% (5)
Emerging Technologies Exam Guide
4 pages
Remedial Chemistry Model Exam 2024
100% (3)
Remedial Chemistry Model Exam 2024
6 pages
Chemistry MODEL EXAM - 2
No ratings yet
Chemistry MODEL EXAM - 2
2 pages
IT Final Exam Paper 2016
No ratings yet
IT Final Exam Paper 2016
7 pages
Art Quadra FX Manual
No ratings yet
Art Quadra FX Manual
76 pages
Chemistry Definitions by Usman Sir
No ratings yet
Chemistry Definitions by Usman Sir
2 pages
Bcan 201 (New) Dca 201
No ratings yet
Bcan 201 (New) Dca 201
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Design Optimization of Crude Oil Distillation
No ratings yet
Design Optimization of Crude Oil Distillation
8 pages
OS Concepts for BSc IT Students
No ratings yet
OS Concepts for BSc IT Students
3 pages
988.18 Redigel PCA
No ratings yet
988.18 Redigel PCA
1 page
Kurmanji Basic Learning Manual
No ratings yet
Kurmanji Basic Learning Manual
32 pages
Power Series Solutions - Complete
No ratings yet
Power Series Solutions - Complete
65 pages
HVAC Duct Design Lab Guide
No ratings yet
HVAC Duct Design Lab Guide
8 pages
Stairs: A Little Bit About Them: Slope
No ratings yet
Stairs: A Little Bit About Them: Slope
2 pages
Third Quarter Departmental Test in ICT 1
100% (1)
Third Quarter Departmental Test in ICT 1
3 pages
Hill Train Power Generation and Automatic Railway Gate Controlling
No ratings yet
Hill Train Power Generation and Automatic Railway Gate Controlling
73 pages
NEET/JEE Chemistry Formula Guide
100% (1)
NEET/JEE Chemistry Formula Guide
18 pages
EO Extension in R12
100% (2)
EO Extension in R12
10 pages
Physical Science Question Class X
No ratings yet
Physical Science Question Class X
9 pages
Adrf 5141
No ratings yet
Adrf 5141
13 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
29 pages
Questionnaire Performance Testing
No ratings yet
Questionnaire Performance Testing
10 pages
Srividya College of Engineering and Technology Question Bank
No ratings yet
Srividya College of Engineering and Technology Question Bank
8 pages
02 Air Conditioning Tools
No ratings yet
02 Air Conditioning Tools
23 pages
Python Course Syllabus
No ratings yet
Python Course Syllabus
5 pages
Admin & Teacher Impact on NAT Performance
No ratings yet
Admin & Teacher Impact on NAT Performance
12 pages
Introduction To The Pythagorean Tarot
100% (1)
Introduction To The Pythagorean Tarot
8 pages
Innovative Air and Gas Movement Solutions For Power Generation
No ratings yet
Innovative Air and Gas Movement Solutions For Power Generation
28 pages
Moisture Content Determination
No ratings yet
Moisture Content Determination
5 pages
J Matchar 2021 110911
No ratings yet
J Matchar 2021 110911
14 pages
Multiplying Two-Digit by Two-Digit Numbers Education Presentation in Cream Green Orange Nostalgic Handdrawn Style
No ratings yet
Multiplying Two-Digit by Two-Digit Numbers Education Presentation in Cream Green Orange Nostalgic Handdrawn Style
13 pages
SOP Pronouns EXERCISE
No ratings yet
SOP Pronouns EXERCISE
1 page
Operating System - Assignment 5: 18F-0123 Amina Javed
No ratings yet
Operating System - Assignment 5: 18F-0123 Amina Javed
8 pages

IR Chapter 2 Text Operations

Uploaded by

IR Chapter 2 Text Operations

Uploaded by

Chapter Two

 A few words are very common.

 Zipf's Law states that when the distinct words in a text

That is If the words, w, in a collection are ranked, r,

w has rank r and

 The table shows the most frequently occurring words

 Luhn suggested that both extremely common and extremely

 For this, Luhn specifies two cut-off points: an upper and a

Luhn (1958) suggested that both extremely common and

 Therefore, one needs to preprocess the text of a

 The main operations for selecting index terms are:

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

 Punctuation marks – remove totally unless significant,

 Output: Tokens (an instance of a sequence of characters that are

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

 But what are valid tokens to omit? 21

 Issues of tokenization are language specific

 Using a similarity measure between the query

You might also like