0% found this document useful (0 votes)

68 views16 pages

Implementation

The document describes the basic steps to implement a vector space retrieval model: [1] preprocess documents by tokenizing, removing stop words, and stemming; [2] build an inverted index that maps terms to the documents they appear in along with term and document frequencies; [3] use the inverted index to retrieve documents for a query and incrementally calculate cosine similarity scores; [4] rank documents by their similarity scores. Key aspects are weighting terms based on tf-idf, evaluating using precision and recall, and the linear time complexity of indexing.

Uploaded by

Rihab BEN LAMINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views16 pages

Implementation

Uploaded by

Rihab BEN LAMINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Basic Tokenizing,

Indexing, and
Implementation of
Vector-Space Retrieval

1
Naïve Implementation
Convert all documents in collection D to tf-idf
weighted vectors, dj, for keyword vocabulary V.
Convert query to a tf-idf-weighted vector q.
For each dj in D do
Compute score sj = cosSim(dj, q)
Sort documents by decreasing score.
Present top ranked documents to the user.

Time complexity: O(|V|·|D|) Bad for large V & D !

|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

2
Practical Implementation

• Based on the observation that documents

containing none of the query keywords do
not affect the final ranking
• Try to identify only those documents that
contain at least one query keyword
• Actual implementation of an inverted index

3
Step 1: Preprocessing

• Implement the preprocessing functions:

– For tokenization
– For stop word removal
– For stemming

• Input: Documents that are read one by one

from the collection
• Output: Tokens to be added to the index
– No punctuation, no stop-words, stemmed
4
Step 2: Indexing

• Build an inverted index, with an entry for

each word in the vocabulary

• Input: Tokens obtained from the

preprocessing module
• Output: An inverted index for fast access

5
Step 2 (cont’d)

• Many data structures are appropriate for fast

access
– B-trees, sparse lists, hashtables
• We need:
– One entry for each word in the vocabulary
– For each such entry:
• Keep a list of all the documents where it appears
together with the corresponding frequency  TF
– For each such entry, keep the total number of
documents where the word occurred:
•  IDF 6
Step 2 (cont’d)

Dj, tfj
Index terms df
computer 3 D7, 4
database 2 D1, 3



science 4 D2, 4
system 1 D5 , 2
Index file lists

7
Step 2 (cont’d)

• Term frequencies and DF for each token can

be computed in one pass
• Cosine similarity also requires the lengths of
the document vectors.
• Might need a second pass (through document
collection or the inverted index) to compute
document vector lengths.

8
Step 2 (cont’d)
– Remember the weight of a token is: TF * IDF
– Therefore, must wait until IDF’s are known
(and therefore until all documents are indexed)
before document lengths can be determined.
– Remember that the length of a document vector
is the square-root of sum of the squares of the
weights of its tokens.
• Do a second pass over all documents: keep
a list or hashtable with all document id-s,
and for each document determine the length
of its vector.
9
Time Complexity of Indexing

• Complexity of creating vector and indexing

a document of n tokens is O(n).
• So indexing m such documents is O(m n).
• Computing token IDFs can be done during
the same first pass
• Computing vector lengths is also O(m n).
• Complete process is O(m n), which is also
the complexity of just reading in the corpus.

10
Step 3: Retrieval

• Use inverted index (from step 2) to find the

limited set of documents that contain at least
one of the query words.
• Incrementally compute cosine similarity of
each indexed document as query words are
processed one by one.
• To accumulate a total score for each retrieved
document, store retrieved documents in a
hashtable, where the document id is the key,
and the partial accumulated score is the value.
11
Step 3 (cont’d)

• Input: Query and Inverted Index (from

Step2)
• Output: Similarity values between query
and documents

12
Step 4: Ranking

• Sort the hashtable including the retrieved

documents based on the value of cosine
similarity
• Return the documents in descending order of
their relevance
• Input: Similarity values between query and
documents
• Output: Ranked list of documented in
reversed order of their relevance
13
What weighting methods?

• Weights applied to both document terms and

query terms
• Direct impact on the final ranking
•  Direct impact on the results
•  Direct impact on the quality of IR system

14
Standard Evaluation Measures

Starts with a CONTINGENCY table for each query

retrieved not retrieved

relevant TP FN n1 = TP + FN

not relevant FP TN

n2 = TP + FP N

15
Precision and Recall

From all the documents that are relevant out there,

how many did the IR system retrieve?
TP
Recall:
TP + FN

From all the documents that are retrieved by the IR system,

how many are relevant?
TP
Precision:
TP+FP

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
33 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
IR Journal
No ratings yet
IR Journal
36 pages
IRS Unit-3
100% (2)
IRS Unit-3
28 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
1 Overview
No ratings yet
1 Overview
44 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
Automatic Indexing Techniques
No ratings yet
Automatic Indexing Techniques
48 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Aspect Information Retrieval (IR) Web Search
No ratings yet
Aspect Information Retrieval (IR) Web Search
19 pages
Thesis
No ratings yet
Thesis
49 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
10.1007@978 3 030 38445 6
No ratings yet
10.1007@978 3 030 38445 6
243 pages
Bases de Données MySQL Triggers Corrigé
No ratings yet
Bases de Données MySQL Triggers Corrigé
9 pages
Multi-Channel Deep Convolutional Neural Networks For Multi-Classifying Thyroid Disease
No ratings yet
Multi-Channel Deep Convolutional Neural Networks For Multi-Classifying Thyroid Disease
14 pages
Chapter10 Lab
No ratings yet
Chapter10 Lab
17 pages
Building Web Service Ontologies
No ratings yet
Building Web Service Ontologies
187 pages
PR Example Sol
No ratings yet
PR Example Sol
2 pages
P 01 Intro
No ratings yet
P 01 Intro
70 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Oracle DB Guide for IT Professionals
No ratings yet
Oracle DB Guide for IT Professionals
3 pages
Swi Prolog Tutorial
No ratings yet
Swi Prolog Tutorial
14 pages
A Blockchain Based File Sharing System For Academic Paper Review
No ratings yet
A Blockchain Based File Sharing System For Academic Paper Review
11 pages
11 Dbms 38
No ratings yet
11 Dbms 38
8 pages
The Bank Management System in VB 6
No ratings yet
The Bank Management System in VB 6
32 pages
Azure Cosmos DB and DocumentDB Succinctly
100% (2)
Azure Cosmos DB and DocumentDB Succinctly
103 pages
002 Flutter With PHP and MySql - Build Online Quiz With Admin Panel
No ratings yet
002 Flutter With PHP and MySql - Build Online Quiz With Admin Panel
8 pages
ADB Chapter One
No ratings yet
ADB Chapter One
48 pages
BUSN 370 - Online Sessions' Library
No ratings yet
BUSN 370 - Online Sessions' Library
3 pages
Niw PJ DGZ
No ratings yet
Niw PJ DGZ
3 pages
Dump Short EN
No ratings yet
Dump Short EN
5 pages
ADBMS Chapter 3
No ratings yet
ADBMS Chapter 3
38 pages
Project File X 402 24-25
No ratings yet
Project File X 402 24-25
8 pages
Business Intelligence Lecture Notes-21-05
No ratings yet
Business Intelligence Lecture Notes-21-05
14 pages
3 Eer 05 01 2024
No ratings yet
3 Eer 05 01 2024
38 pages
Subqueries in SQL
No ratings yet
Subqueries in SQL
17 pages
Neha Singh Mobile: +91 6360945380 Career Objectives:: Automation Testing Skills
No ratings yet
Neha Singh Mobile: +91 6360945380 Career Objectives:: Automation Testing Skills
5 pages
DSS TestBank
No ratings yet
DSS TestBank
253 pages
Ourcodeworld How To Properly Count All The Rows From A Table With Doctrine in Symfony 4 PDF
No ratings yet
Ourcodeworld How To Properly Count All The Rows From A Table With Doctrine in Symfony 4 PDF
7 pages
DataStage SCD Implementation Guide
No ratings yet
DataStage SCD Implementation Guide
13 pages
Theory Assignment 1 (1) - 250320 - 173846
No ratings yet
Theory Assignment 1 (1) - 250320 - 173846
3 pages
Cracking Core Java Interviews Sample PDF
No ratings yet
Cracking Core Java Interviews Sample PDF
51 pages
Azure Data Factory Mapping Data Flows
No ratings yet
Azure Data Factory Mapping Data Flows
22 pages
HANA System Replication
No ratings yet
HANA System Replication
24 pages
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
100% (1)
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
7 pages
Unit 1 Rdbms
No ratings yet
Unit 1 Rdbms
42 pages
Cognizant Interview QA 2025
No ratings yet
Cognizant Interview QA 2025
3 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
IT Notes (Prashant Kirad)
No ratings yet
IT Notes (Prashant Kirad)
28 pages

Implementation

Uploaded by

Implementation

Uploaded by

Basic Tokenizing,

Time complexity: O(|V|·|D|) Bad for large V & D !

• Based on the observation that documents

• Implement the preprocessing functions:

• Input: Documents that are read one by one

• Build an inverted index, with an entry for

• Input: Tokens obtained from the

• Many data structures are appropriate for fast

• Term frequencies and DF for each token can

• Complexity of creating vector and indexing

• Use inverted index (from step 2) to find the

• Input: Query and Inverted Index (from

• Sort the hashtable including the retrieved

• Weights applied to both document terms and

Starts with a CONTINGENCY table for each query

retrieved not retrieved

From all the documents that are relevant out there,

From all the documents that are retrieved by the IR system,

You might also like