Search Engine Architecture I
Software Architecture
! The high level structure of a software system ! Software components ! The interfaces provided by those components ! The relationships between those components
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
UIMA
! An architecture to provide a standard for integrating search and related language technology components ! Unstructured Information Management Architecture (www.research.ibm.com/UIMA) ! Defining interfaces for components to simplify the addition of new technologies into systems that handle text and other unstructured data
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3
UIMA
http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
A Good Reference
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
Primary Goals of Search Engines
! Effectiveness (quality): to retrieve the most relevant set of documents for a query
! Process text and store text statistics to improve relevance
! Efficiency (speed): process queries from users as fast as possible
! Use specialized data structures
! Specific goals usually fall into the above primary goals
! Example: handling changing document collections both an effectiveness issue and an efficiency issue
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6
Two Major Functions
! Search engine components support two major functions ! The index process: building data structures that enable searching ! The query process: using those data structures to produce a ranked list of documents for a users query
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
The Indexing Process
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
Text Acquisition
! Identifying and making available the documents that will be searched ! How?
! Crawling or scanning the web, a corporate intranet, or other sources of information ! Building a document data store containing the text and metadata for all the documents
! Metadata: document type, document structure, document length,
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
Crawlers
! Identifying and acquiring documents for the search engine ! Web crawler: following links on web pages to discover new pages
! Efficiency: how to handle the huge volume of new pages and updated pages
! Web crawler restricted to a single site supports site search ! Topic-based/focused crawlers: using classification techniques to restrict pages that are likely relevant to a specific topic
! Used in vertical or topical search
! Enterprise document crawler: following links to discover both internal and external pages
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 10
Document Feeds
! A mechanism for accessing a real-time stream of documents ! RSS: a common standard used for web feeds for content such as news, blogs, or video
! An RSS reader subscribes to RSS feeds, and provides new content when it arrives ! RSS feeds are formatted in XML
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
11
Conversion
! Converting a variety of formats (e.g., HTML, XML, PDF, ) into a consistent text and metadata format ! Resolving encoding problem
! Using ASCII (7 bits) or extended ASCII (8 bits) for English ! Using Unicode (16 bits) for international languages
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
12
Document Data Stores
! A simple database to manage large numbers of documents and structured data ! Document components are typically stored in a compressed form for efficiency ! Structured data consists of document metadata and other information extracted from the documents such as links and anchor text ! [Discussion] Why do we need document data stores local to search engines?
! The original documents are available on the web
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13
The Indexing Process
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
14
Text Transformation / Index Creation
! Text transformation: transforming documents into index terms or features
! Index terms: the parts of a document that are stored in the index and used in searching ! Features: parts of a text document that is used to represent its content ! Examples: phrases, names, dates, links, ! Index vocabulary: the set of all the terms that are indexed for a document collection
! Index creation: creating the indexes or data structures
! Example: building inverted list indexes
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15
Parsers, Stopping, and Stemming
! Processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings
! Tokenization: identifying units to be indexed ! Using syntax of markup languages to identify structures
! Stopping: removing common words from the stream of tokens, e.g., the, of, to,
! Reducing index size considerably
! Stemming: group words that are derived from a common stem
! Example: fish, fishes, fishing ! Increase the likelihood that words used in queries and documents will match
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16
Other Text Transformation Tasks
! Link extraction and analysis
! Links can be indexed separately from the general text content ! Using link analysis algorithms, e.g., PageRank, to quantify page popularity and find authority pages ! Using anchor text to enhance the text content of a page that the link points to
! Information extraction: identifying index terms that are more complex than single words
! Entity identification, e.g., finding names
! Classifiers: identifying class-related metadata for documents or parts of documents
! Example: finding spam documents and non-content parts of documents (e.g., ads) ! Alternatively, clustering related documents
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17
The Indexing Process
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
18
Collecting Document Statistics
! Gathering and recording statistical information about words, features, and documents
! Statistics will be used to compute scores of documents ! Stored in lookup tables
! Examples
! Counts of index term occurrences ! Positions in the documents where the index terms occurred ! Counts of occurrences over groups of documents ! Lengths of documents in terms of the number of tokens
! Actual data depends on the retrieval model and the associated ranking method
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19
Weighting
! Calculating index term weights using the document statistics and storing them in lookup tables
! Pre-computation can improve query answering efficiency
! TF/IDF weighting
! TF (the term frequency): the frequency of index term occurrences in a document ! IDF (inverse document frequency): the inverse of the frequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documents containing a particular term)
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20
Inversion and Index Distribution
! Inversion: changing the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes
! Core of the indexing process ! The number of documents is large ! The indexes are updated with new documents from feeds and crawls, and are often compressed for high efficiency
! Index distribution: distributing indexes across multiple computers/sites on a network
! Document distribution, term distribution, and replication
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21
Summary
! A high-level description of search engine software architecture ! The indexing process ! Building blocks and their functionalities
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
22
To-Do List
! Expand the figures of the indexing process to include the detailed functionalities ! Reach Chapter 2.1-2.3.3
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture
23