KEMBAR78
Implementation | PDF | Search Engine Indexing | Information Retrieval
0% found this document useful (0 votes)
68 views16 pages

Implementation

The document describes the basic steps to implement a vector space retrieval model: [1] preprocess documents by tokenizing, removing stop words, and stemming; [2] build an inverted index that maps terms to the documents they appear in along with term and document frequencies; [3] use the inverted index to retrieve documents for a query and incrementally calculate cosine similarity scores; [4] rank documents by their similarity scores. Key aspects are weighting terms based on tf-idf, evaluating using precision and recall, and the linear time complexity of indexing.

Uploaded by

Rihab BEN LAMINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views16 pages

Implementation

The document describes the basic steps to implement a vector space retrieval model: [1] preprocess documents by tokenizing, removing stop words, and stemming; [2] build an inverted index that maps terms to the documents they appear in along with term and document frequencies; [3] use the inverted index to retrieve documents for a query and incrementally calculate cosine similarity scores; [4] rank documents by their similarity scores. Key aspects are weighting terms based on tf-idf, evaluating using precision and recall, and the linear time complexity of indexing.

Uploaded by

Rihab BEN LAMINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Basic Tokenizing,

Indexing, and
Implementation of
Vector-Space Retrieval

1
Naïve Implementation
Convert all documents in collection D to tf-idf
weighted vectors, dj, for keyword vocabulary V.
Convert query to a tf-idf-weighted vector q.
For each dj in D do
Compute score sj = cosSim(dj, q)
Sort documents by decreasing score.
Present top ranked documents to the user.

Time complexity: O(|V|·|D|) Bad for large V & D !


|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

2
Practical Implementation

• Based on the observation that documents


containing none of the query keywords do
not affect the final ranking
• Try to identify only those documents that
contain at least one query keyword
• Actual implementation of an inverted index

3
Step 1: Preprocessing

• Implement the preprocessing functions:


– For tokenization
– For stop word removal
– For stemming

• Input: Documents that are read one by one


from the collection
• Output: Tokens to be added to the index
– No punctuation, no stop-words, stemmed
4
Step 2: Indexing

• Build an inverted index, with an entry for


each word in the vocabulary

• Input: Tokens obtained from the


preprocessing module
• Output: An inverted index for fast access

5
Step 2 (cont’d)

• Many data structures are appropriate for fast


access
– B-trees, sparse lists, hashtables
• We need:
– One entry for each word in the vocabulary
– For each such entry:
• Keep a list of all the documents where it appears
together with the corresponding frequency  TF
– For each such entry, keep the total number of
documents where the word occurred:
•  IDF 6
Step 2 (cont’d)

Dj, tfj
Index terms df
computer 3 D7, 4
database 2 D1, 3



science 4 D2, 4
system 1 D5 , 2
Index file lists

7
Step 2 (cont’d)

• Term frequencies and DF for each token can


be computed in one pass
• Cosine similarity also requires the lengths of
the document vectors.
• Might need a second pass (through document
collection or the inverted index) to compute
document vector lengths.

8
Step 2 (cont’d)
– Remember the weight of a token is: TF * IDF
– Therefore, must wait until IDF’s are known
(and therefore until all documents are indexed)
before document lengths can be determined.
– Remember that the length of a document vector
is the square-root of sum of the squares of the
weights of its tokens.
• Do a second pass over all documents: keep
a list or hashtable with all document id-s,
and for each document determine the length
of its vector.
9
Time Complexity of Indexing

• Complexity of creating vector and indexing


a document of n tokens is O(n).
• So indexing m such documents is O(m n).
• Computing token IDFs can be done during
the same first pass
• Computing vector lengths is also O(m n).
• Complete process is O(m n), which is also
the complexity of just reading in the corpus.

10
Step 3: Retrieval

• Use inverted index (from step 2) to find the


limited set of documents that contain at least
one of the query words.
• Incrementally compute cosine similarity of
each indexed document as query words are
processed one by one.
• To accumulate a total score for each retrieved
document, store retrieved documents in a
hashtable, where the document id is the key,
and the partial accumulated score is the value.
11
Step 3 (cont’d)

• Input: Query and Inverted Index (from


Step2)
• Output: Similarity values between query
and documents

12
Step 4: Ranking

• Sort the hashtable including the retrieved


documents based on the value of cosine
similarity
• Return the documents in descending order of
their relevance
• Input: Similarity values between query and
documents
• Output: Ranked list of documented in
reversed order of their relevance
13
What weighting methods?

• Weights applied to both document terms and


query terms
• Direct impact on the final ranking
•  Direct impact on the results
•  Direct impact on the quality of IR system

14
Standard Evaluation Measures

Starts with a CONTINGENCY table for each query

retrieved not retrieved

relevant TP FN n1 = TP + FN

not relevant FP TN

n2 = TP + FP N

15
Precision and Recall

From all the documents that are relevant out there,


how many did the IR system retrieve?
TP
Recall:
TP + FN

From all the documents that are retrieved by the IR system,


how many are relevant?
TP
Precision:
TP+FP

16

You might also like