KEMBAR78
Search Engine Architecture 1 | PDF | Search Engine Indexing | Information Retrieval
0% found this document useful (0 votes)
492 views23 pages

Search Engine Architecture 1

The document describes the key components of a search engine architecture. It discusses the indexing process which involves acquiring documents, transforming text, analyzing links and metadata, collecting statistics, and building inverted indexes. It also covers the query process which uses the indexes to return relevant results. The goal is to provide effective yet efficient search by representing documents and queries in a way that supports fast retrieval of relevant information.

Uploaded by

aadafull
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
492 views23 pages

Search Engine Architecture 1

The document describes the key components of a search engine architecture. It discusses the indexing process which involves acquiring documents, transforming text, analyzing links and metadata, collecting statistics, and building inverted indexes. It also covers the query process which uses the indexes to return relevant results. The goal is to provide effective yet efficient search by representing documents and queries in a way that supports fast retrieval of relevant information.

Uploaded by

aadafull
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Search Engine Architecture I

Software Architecture
! The high level structure of a software system ! Software components ! The interfaces provided by those components ! The relationships between those components

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

UIMA
! An architecture to provide a standard for integrating search and related language technology components ! Unstructured Information Management Architecture (www.research.ibm.com/UIMA) ! Defining interfaces for components to simplify the addition of new technologies into systems that handle text and other unstructured data
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3

UIMA

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

A Good Reference

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Primary Goals of Search Engines


! Effectiveness (quality): to retrieve the most relevant set of documents for a query
! Process text and store text statistics to improve relevance

! Efficiency (speed): process queries from users as fast as possible


! Use specialized data structures

! Specific goals usually fall into the above primary goals


! Example: handling changing document collections both an effectiveness issue and an efficiency issue
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6

Two Major Functions


! Search engine components support two major functions ! The index process: building data structures that enable searching ! The query process: using those data structures to produce a ranked list of documents for a users query

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Text Acquisition
! Identifying and making available the documents that will be searched ! How?
! Crawling or scanning the web, a corporate intranet, or other sources of information ! Building a document data store containing the text and metadata for all the documents
! Metadata: document type, document structure, document length,

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

Crawlers
! Identifying and acquiring documents for the search engine ! Web crawler: following links on web pages to discover new pages
! Efficiency: how to handle the huge volume of new pages and updated pages

! Web crawler restricted to a single site supports site search ! Topic-based/focused crawlers: using classification techniques to restrict pages that are likely relevant to a specific topic
! Used in vertical or topical search

! Enterprise document crawler: following links to discover both internal and external pages
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 10

Document Feeds
! A mechanism for accessing a real-time stream of documents ! RSS: a common standard used for web feeds for content such as news, blogs, or video
! An RSS reader subscribes to RSS feeds, and provides new content when it arrives ! RSS feeds are formatted in XML

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

11

Conversion
! Converting a variety of formats (e.g., HTML, XML, PDF, ) into a consistent text and metadata format ! Resolving encoding problem
! Using ASCII (7 bits) or extended ASCII (8 bits) for English ! Using Unicode (16 bits) for international languages

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

12

Document Data Stores


! A simple database to manage large numbers of documents and structured data ! Document components are typically stored in a compressed form for efficiency ! Structured data consists of document metadata and other information extracted from the documents such as links and anchor text ! [Discussion] Why do we need document data stores local to search engines?
! The original documents are available on the web
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

14

Text Transformation / Index Creation


! Text transformation: transforming documents into index terms or features
! Index terms: the parts of a document that are stored in the index and used in searching ! Features: parts of a text document that is used to represent its content ! Examples: phrases, names, dates, links, ! Index vocabulary: the set of all the terms that are indexed for a document collection

! Index creation: creating the indexes or data structures


! Example: building inverted list indexes
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15

Parsers, Stopping, and Stemming


! Processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings
! Tokenization: identifying units to be indexed ! Using syntax of markup languages to identify structures

! Stopping: removing common words from the stream of tokens, e.g., the, of, to,
! Reducing index size considerably

! Stemming: group words that are derived from a common stem


! Example: fish, fishes, fishing ! Increase the likelihood that words used in queries and documents will match
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16

Other Text Transformation Tasks


! Link extraction and analysis
! Links can be indexed separately from the general text content ! Using link analysis algorithms, e.g., PageRank, to quantify page popularity and find authority pages ! Using anchor text to enhance the text content of a page that the link points to

! Information extraction: identifying index terms that are more complex than single words
! Entity identification, e.g., finding names

! Classifiers: identifying class-related metadata for documents or parts of documents


! Example: finding spam documents and non-content parts of documents (e.g., ads) ! Alternatively, clustering related documents
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17

The Indexing Process

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

18

Collecting Document Statistics


! Gathering and recording statistical information about words, features, and documents
! Statistics will be used to compute scores of documents ! Stored in lookup tables

! Examples
! Counts of index term occurrences ! Positions in the documents where the index terms occurred ! Counts of occurrences over groups of documents ! Lengths of documents in terms of the number of tokens

! Actual data depends on the retrieval model and the associated ranking method
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19

Weighting
! Calculating index term weights using the document statistics and storing them in lookup tables
! Pre-computation can improve query answering efficiency

! TF/IDF weighting
! TF (the term frequency): the frequency of index term occurrences in a document ! IDF (inverse document frequency): the inverse of the frequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documents containing a particular term)
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20

Inversion and Index Distribution


! Inversion: changing the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes
! Core of the indexing process ! The number of documents is large ! The indexes are updated with new documents from feeds and crawls, and are often compressed for high efficiency

! Index distribution: distributing indexes across multiple computers/sites on a network


! Document distribution, term distribution, and replication
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21

Summary
! A high-level description of search engine software architecture ! The indexing process ! Building blocks and their functionalities

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

22

To-Do List
! Expand the figures of the indexing process to include the detailed functionalities ! Reach Chapter 2.1-2.3.3

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture

23

You might also like