0% found this document useful (0 votes)

492 views23 pages

Search Engine Architecture 1

The document describes the key components of a search engine architecture. It discusses the indexing process which involves acquiring documents, transforming text, analyzing links and metadata, collecting statistics, and building inverted indexes. It also covers the query process which uses the indexes to return relevant results. The goal is to provide effective yet efficient search by representing documents and queries in a way that supports fast retrieval of relevant information.

Uploaded by

aadafull

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

492 views23 pages

Search Engine Architecture 1

Uploaded by

aadafull

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Search Engine Architecture I

Software Architecture
! The high level structure of a software system ! Software components ! The interfaces provided by those components ! The relationships between those components

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

UIMA
! An architecture to provide a standard for integrating search and related language technology components ! Unstructured Information Management Architecture (www.research.ibm.com/UIMA) ! Defining interfaces for components to simplify the addition of new technologies into systems that handle text and other unstructured data
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3

UIMA

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

A Good Reference

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Primary Goals of Search Engines

! Effectiveness (quality): to retrieve the most relevant set of documents for a query
! Process text and store text statistics to improve relevance

! Efficiency (speed): process queries from users as fast as possible

! Use specialized data structures

! Specific goals usually fall into the above primary goals

! Example: handling changing document collections both an effectiveness issue and an efficiency issue
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6

Two Major Functions

! Search engine components support two major functions ! The index process: building data structures that enable searching ! The query process: using those data structures to produce a ranked list of documents for a users query

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Text Acquisition
! Identifying and making available the documents that will be searched ! How?
! Crawling or scanning the web, a corporate intranet, or other sources of information ! Building a document data store containing the text and metadata for all the documents
! Metadata: document type, document structure, document length,

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Crawlers
! Identifying and acquiring documents for the search engine ! Web crawler: following links on web pages to discover new pages
! Efficiency: how to handle the huge volume of new pages and updated pages

! Web crawler restricted to a single site supports site search ! Topic-based/focused crawlers: using classification techniques to restrict pages that are likely relevant to a specific topic
! Used in vertical or topical search

! Enterprise document crawler: following links to discover both internal and external pages
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 10

Document Feeds
! A mechanism for accessing a real-time stream of documents ! RSS: a common standard used for web feeds for content such as news, blogs, or video
! An RSS reader subscribes to RSS feeds, and provides new content when it arrives ! RSS feeds are formatted in XML

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Conversion
! Converting a variety of formats (e.g., HTML, XML, PDF, ) into a consistent text and metadata format ! Resolving encoding problem
! Using ASCII (7 bits) or extended ASCII (8 bits) for English ! Using Unicode (16 bits) for international languages

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Document Data Stores

! A simple database to manage large numbers of documents and structured data ! Document components are typically stored in a compressed form for efficiency ! Structured data consists of document metadata and other information extracted from the documents such as links and anchor text ! [Discussion] Why do we need document data stores local to search engines?
! The original documents are available on the web
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Text Transformation / Index Creation

! Text transformation: transforming documents into index terms or features
! Index terms: the parts of a document that are stored in the index and used in searching ! Features: parts of a text document that is used to represent its content ! Examples: phrases, names, dates, links, ! Index vocabulary: the set of all the terms that are indexed for a document collection

! Index creation: creating the indexes or data structures

! Example: building inverted list indexes
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15

Parsers, Stopping, and Stemming

! Processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings
! Tokenization: identifying units to be indexed ! Using syntax of markup languages to identify structures

! Stopping: removing common words from the stream of tokens, e.g., the, of, to,
! Reducing index size considerably

! Stemming: group words that are derived from a common stem

! Example: fish, fishes, fishing ! Increase the likelihood that words used in queries and documents will match
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16

Other Text Transformation Tasks

! Link extraction and analysis
! Links can be indexed separately from the general text content ! Using link analysis algorithms, e.g., PageRank, to quantify page popularity and find authority pages ! Using anchor text to enhance the text content of a page that the link points to

! Information extraction: identifying index terms that are more complex than single words
! Entity identification, e.g., finding names

! Classifiers: identifying class-related metadata for documents or parts of documents

! Example: finding spam documents and non-content parts of documents (e.g., ads) ! Alternatively, clustering related documents
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Collecting Document Statistics

! Gathering and recording statistical information about words, features, and documents
! Statistics will be used to compute scores of documents ! Stored in lookup tables

! Examples
! Counts of index term occurrences ! Positions in the documents where the index terms occurred ! Counts of occurrences over groups of documents ! Lengths of documents in terms of the number of tokens

! Actual data depends on the retrieval model and the associated ranking method
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19

Weighting
! Calculating index term weights using the document statistics and storing them in lookup tables
! Pre-computation can improve query answering efficiency

! TF/IDF weighting
! TF (the term frequency): the frequency of index term occurrences in a document ! IDF (inverse document frequency): the inverse of the frequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documents containing a particular term)
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20

Inversion and Index Distribution

! Inversion: changing the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes
! Core of the indexing process ! The number of documents is large ! The indexes are updated with new documents from feeds and crawls, and are often compressed for high efficiency

! Index distribution: distributing indexes across multiple computers/sites on a network

! Document distribution, term distribution, and replication
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21

Summary
! A high-level description of search engine software architecture ! The indexing process ! Building blocks and their functionalities

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

To-Do List
! Expand the figures of the indexing process to include the detailed functionalities ! Reach Chapter 2.1-2.3.3

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

File Management in Operating System of A Computer
100% (1)
File Management in Operating System of A Computer
70 pages
IT2403 Systems Analysis and Design: (Compulsory)
No ratings yet
IT2403 Systems Analysis and Design: (Compulsory)
6 pages
GIMP Guide for Beginners
No ratings yet
GIMP Guide for Beginners
16 pages
Mobile App Development Types
No ratings yet
Mobile App Development Types
15 pages
API Configuration Settings Guide V1
No ratings yet
API Configuration Settings Guide V1
17 pages
Adobe-Analytics-Table of Contents
No ratings yet
Adobe-Analytics-Table of Contents
5 pages
Chapter 1 Critical Thinking Answers
No ratings yet
Chapter 1 Critical Thinking Answers
12 pages
SE Assignments PDF
No ratings yet
SE Assignments PDF
24 pages
Web Programming BCA - Unit 1 Study Materials (BHARATHIAR UNIVERSITY)
No ratings yet
Web Programming BCA - Unit 1 Study Materials (BHARATHIAR UNIVERSITY)
9 pages
How To Change Word Default Font in WPS Office Word
No ratings yet
How To Change Word Default Font in WPS Office Word
5 pages
Web Content Development Notes BCA VI Semester NEP
No ratings yet
Web Content Development Notes BCA VI Semester NEP
14 pages
Software Engineering - QuestionBank
No ratings yet
Software Engineering - QuestionBank
3 pages
5marks C Programming Important Qa
No ratings yet
5marks C Programming Important Qa
54 pages
Mobile App Security
No ratings yet
Mobile App Security
11 pages
SEARCH ENGINE (Synopsis) - Vivek
No ratings yet
SEARCH ENGINE (Synopsis) - Vivek
17 pages
Uninformed Search: Some Material Adopted From Notes and Slides by Marie Desjardins and Charles R. Dyer
No ratings yet
Uninformed Search: Some Material Adopted From Notes and Slides by Marie Desjardins and Charles R. Dyer
56 pages
MAD - Unit - 3 - Notes Multimedia, Animation and Graphics
No ratings yet
MAD - Unit - 3 - Notes Multimedia, Animation and Graphics
15 pages
(Csc. 351 Software Engineering) : Lecturer: Hiranya Bastakoti
No ratings yet
(Csc. 351 Software Engineering) : Lecturer: Hiranya Bastakoti
13 pages
Mobile App Development (SENG-3111)
No ratings yet
Mobile App Development (SENG-3111)
21 pages
IT Applications for Students
No ratings yet
IT Applications for Students
31 pages
PGDCA Semester 2 Course Outline
No ratings yet
PGDCA Semester 2 Course Outline
11 pages
Access Tutorial 2 Building A Database and Defining Table Relationships
No ratings yet
Access Tutorial 2 Building A Database and Defining Table Relationships
14 pages
Gmail Guide: The Ultimate
No ratings yet
Gmail Guide: The Ultimate
11 pages
102 Algorithm Specification
No ratings yet
102 Algorithm Specification
36 pages
E-Commerce in Passenger Air Transport
No ratings yet
E-Commerce in Passenger Air Transport
14 pages
What Is HTML
No ratings yet
What Is HTML
10 pages
OOP Course Overview for Students
No ratings yet
OOP Course Overview for Students
11 pages
Social Media Networks Mas 203
100% (9)
Social Media Networks Mas 203
66 pages
PS Software - Ms Word
No ratings yet
PS Software - Ms Word
31 pages
Object-Oriented Design Q&A
100% (2)
Object-Oriented Design Q&A
5 pages
It Support Notes
No ratings yet
It Support Notes
10 pages
Introduction To Power Point Notes)
No ratings yet
Introduction To Power Point Notes)
10 pages
Web-Interface Designing Technologies
No ratings yet
Web-Interface Designing Technologies
2 pages
Cloud Lab Manual
No ratings yet
Cloud Lab Manual
61 pages
Pointing and Positioning and Animations
No ratings yet
Pointing and Positioning and Animations
11 pages
Bank Management System-Shital
No ratings yet
Bank Management System-Shital
35 pages
Lab Manual
No ratings yet
Lab Manual
191 pages
Mobile Application Development Past Paper 2024
No ratings yet
Mobile Application Development Past Paper 2024
12 pages
Web Application Engineering
No ratings yet
Web Application Engineering
128 pages
HTML Advantage and Disadvantage
No ratings yet
HTML Advantage and Disadvantage
4 pages
Unit V
No ratings yet
Unit V
35 pages
Information Technology Practicle File PDF
No ratings yet
Information Technology Practicle File PDF
15 pages
Ebook Exchange Server 2010 2nd Ed
No ratings yet
Ebook Exchange Server 2010 2nd Ed
349 pages
(A-Star) Search Algorithm
No ratings yet
(A-Star) Search Algorithm
2 pages
Mad (Unit - 5)
No ratings yet
Mad (Unit - 5)
30 pages
007 - 009 Library Automation
No ratings yet
007 - 009 Library Automation
8 pages
MAD Unit - 2
No ratings yet
MAD Unit - 2
92 pages
iOS Developer Profile: Lana Fernando
No ratings yet
iOS Developer Profile: Lana Fernando
3 pages
Cours2 HTML
No ratings yet
Cours2 HTML
13 pages
Computer Graphics: Clipping Techniques
No ratings yet
Computer Graphics: Clipping Techniques
9 pages
Microsoft SQL Server 2005 Interview Questions and
No ratings yet
Microsoft SQL Server 2005 Interview Questions and
8 pages
PHP
No ratings yet
PHP
15 pages
1 AI Notes Complete Watermark
No ratings yet
1 AI Notes Complete Watermark
95 pages
Microsoft Office
No ratings yet
Microsoft Office
9 pages
Concurrency and Transaction Management in An Object Oriented Database
No ratings yet
Concurrency and Transaction Management in An Object Oriented Database
23 pages
Fill in The Blanks: Is A Physical or Conceptual Connection Between Objects
No ratings yet
Fill in The Blanks: Is A Physical or Conceptual Connection Between Objects
3 pages
LESSON 3 - Branching and Looping
100% (5)
LESSON 3 - Branching and Looping
9 pages
Chap 2
No ratings yet
Chap 2
29 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Research 1 PDF
No ratings yet
Research 1 PDF
6 pages
GSM Architecture: Project Report
No ratings yet
GSM Architecture: Project Report
40 pages
Data Warehouse
No ratings yet
Data Warehouse
63 pages
SNA Unit-1
No ratings yet
SNA Unit-1
57 pages
Dabengpre V2
No ratings yet
Dabengpre V2
24 pages
Unit-2 Networking Lecture-10: Introduction To Networking:: Connecting To A Server
No ratings yet
Unit-2 Networking Lecture-10: Introduction To Networking:: Connecting To A Server
20 pages
Asia Indian Society & Culture Hierarchy
No ratings yet
Asia Indian Society & Culture Hierarchy
5 pages
Unit 1
No ratings yet
Unit 1
59 pages
JSP (Java Server Pages)
No ratings yet
JSP (Java Server Pages)
8 pages
Finger (-B) (-F) (-H) (-I) (-L) (-M) (-P) (-Q) (-S) (-W) (Username)
No ratings yet
Finger (-B) (-F) (-H) (-I) (-L) (-M) (-P) (-Q) (-S) (-W) (Username)
14 pages
Practical 2
No ratings yet
Practical 2
1 page
Child Development and Pedagogy
100% (2)
Child Development and Pedagogy
3 pages
HTML Notes
No ratings yet
HTML Notes
22 pages
Google Glass: Abstract
No ratings yet
Google Glass: Abstract
8 pages
Practical 1
No ratings yet
Practical 1
1 page
Knowledge Acquisition for Experts
No ratings yet
Knowledge Acquisition for Experts
5 pages
Company Name: HCL Type: Fresher Job Interview, Question Paper
No ratings yet
Company Name: HCL Type: Fresher Job Interview, Question Paper
1 page
TX-I/O™: Building Technologies
No ratings yet
TX-I/O™: Building Technologies
10 pages
Crane Safety for Engineers & Supervisors
No ratings yet
Crane Safety for Engineers & Supervisors
33 pages
406MHz Sarsat Beacon Tester Brochure Epirb
No ratings yet
406MHz Sarsat Beacon Tester Brochure Epirb
2 pages
Integrated Exercise 1 Answers - API 17 RP Q
No ratings yet
Integrated Exercise 1 Answers - API 17 RP Q
3 pages
Freight Car Service-Worthiness Tests
No ratings yet
Freight Car Service-Worthiness Tests
18 pages
Black Box VGA-Video Ultimate: Remote Control
No ratings yet
Black Box VGA-Video Ultimate: Remote Control
7 pages
A Practical Guide To The Calculation of Uncertainty of Measurement
No ratings yet
A Practical Guide To The Calculation of Uncertainty of Measurement
7 pages
FX Series Communication Manual
No ratings yet
FX Series Communication Manual
186 pages
Operation Manual Komatsu Pc138uslc
100% (1)
Operation Manual Komatsu Pc138uslc
375 pages
Cma Part 1
50% (2)
Cma Part 1
7 pages
Domain Name Evolution Timeline
No ratings yet
Domain Name Evolution Timeline
10 pages
Chapter 03 - Coding in The SAPScript Editor
No ratings yet
Chapter 03 - Coding in The SAPScript Editor
25 pages
Router ARCADYIAN Manual
No ratings yet
Router ARCADYIAN Manual
125 pages
COMPUTER NETWORKS Answers To Selected Exam Questions
No ratings yet
COMPUTER NETWORKS Answers To Selected Exam Questions
33 pages
07 10 00 Damproofing and Waterproofing - Odt
100% (1)
07 10 00 Damproofing and Waterproofing - Odt
16 pages
Project HR Book
No ratings yet
Project HR Book
54 pages
Expulsion Fuse Links For Use in High Voltage (Liston Fusible)
No ratings yet
Expulsion Fuse Links For Use in High Voltage (Liston Fusible)
36 pages
Engaging Students with Tasks
No ratings yet
Engaging Students with Tasks
29 pages
s200 Series Pressure Reducing Regulators Bulletin en 126666
No ratings yet
s200 Series Pressure Reducing Regulators Bulletin en 126666
32 pages
Is 10751-1994
No ratings yet
Is 10751-1994
16 pages
Design Presentation: Virtual Baja Saeindia 2015
100% (1)
Design Presentation: Virtual Baja Saeindia 2015
12 pages
Audit Checklist en
No ratings yet
Audit Checklist en
18 pages
BCS Exam System Overview
No ratings yet
BCS Exam System Overview
6 pages
Malc-Xp 1U Msap: Compact, High-Performance 1U Msap From Express Packet Family
No ratings yet
Malc-Xp 1U Msap: Compact, High-Performance 1U Msap From Express Packet Family
4 pages
M2-100-U v1-1 MIDI 2-0 Specification Overview
No ratings yet
M2-100-U v1-1 MIDI 2-0 Specification Overview
18 pages
imageRUNNER - C1325iF - C1335iF p9112 c3879 en - GB 1422868383
No ratings yet
imageRUNNER - C1325iF - C1335iF p9112 c3879 en - GB 1422868383
4 pages
ATC-1000 Operating Manual
No ratings yet
ATC-1000 Operating Manual
400 pages
Oracle BI 10g Install Guide
No ratings yet
Oracle BI 10g Install Guide
30 pages
0.2 Precourse Assignment-Understanding ISO 9001
0% (1)
0.2 Precourse Assignment-Understanding ISO 9001
10 pages
Industry Guide For Formwork: Construction Industry South Australia JUNE 2012
86% (7)
Industry Guide For Formwork: Construction Industry South Australia JUNE 2012
37 pages

Search Engine Architecture 1

Uploaded by

Search Engine Architecture 1

Uploaded by

Search Engine Architecture I

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Primary Goals of Search Engines

! Efficiency (speed): process queries from users as fast as possible

! Specific goals usually fall into the above primary goals

Two Major Functions

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Document Data Stores

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Text Transformation / Index Creation

! Index creation: creating the indexes or data structures

Parsers, Stopping, and Stemming

! Stemming: group words that are derived from a common stem

Other Text Transformation Tasks

! Classifiers: identifying class-related metadata for documents or parts of documents

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Collecting Document Statistics

Inversion and Index Distribution

! Index distribution: distributing indexes across multiple computers/sites on a network

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

You might also like