KEMBAR78
Web Search Engine | PDF | Search Engine Indexing | Web Search Engine
0% found this document useful (0 votes)
66 views11 pages

Web Search Engine

The document describes the key components of a web search engine including crawling websites to index their content, ranking pages based on relevance to a user's query, and suggesting similar search terms. It also outlines several algorithms and data structures implemented in the search engine like edit distance for spelling suggestions, ternary search trees for autocomplete, and inverted indexes for fast information retrieval. The search engine crawls websites like Wikipedia, Python.org, and Oracle.org using Httrack to build a local dataset for development and testing.

Uploaded by

Vivek Chopra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views11 pages

Web Search Engine

The document describes the key components of a web search engine including crawling websites to index their content, ranking pages based on relevance to a user's query, and suggesting similar search terms. It also outlines several algorithms and data structures implemented in the search engine like edit distance for spelling suggestions, ternary search trees for autocomplete, and inverted indexes for fast information retrieval. The search engine crawls websites like Wikipedia, Python.org, and Oracle.org using Httrack to build a local dataset for development and testing.

Uploaded by

Vivek Chopra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Web Search Engine

Advanced Computing Concepts


University of Windsor

Guided By:
SCOUT Luis Rueda
What is Search
• SEARCH:
Engine
COMPUTING to examine a computer file, disk,
database, or network for particular information.

• ENGINE:
Something the supplies the driving force or
energy to a movement, system or trend.

• SEARCH ENGINE
A computer program that searches for a
particular keywords and returns a list of
documents in which they were found, especially a
commercial service that scans documents on the
internet.
• Crawling:
• Follow links to find
information
• Indexing:
• Record what words appear
where
• Ranking:
• What information is a good
match to user query?
• What information is inherently
good?
• Suggestions:
• Suggests similar words

How Search Engine • Serving:


• Handling queries, find pages,
works? display results
Features
Implemented
• Spell checker – Checks the spellings of the
string entered and find the relevant string.
• Page Ranking – Calculates the score of the
page by calculating the occurrences of the word
and then ranks that page correspondingly
• Word Suggestion – Uses spellcheck and
TST/edit distance to suggest new and related
words by suggesting new and similar words.
• Pattern Matching – To create dictionary of
words from crawled web pages.
• Inverted Index – An inverted index is
an index data structure storing a mapping from
content, such as words or numbers, to its
locations in a document or a set of documents.
EDIT DISTANCE • Edit distance is a way of
quantifying how dissimilar two
strings are by the number of
steps it takes to turn from one
into the other, where a step is
defined as a single character
change.
• Example: The words `computer'
and `commuter' are very similar,
and a change of just one letter,
p->m will change the first word
into the second.
• In our search engine, we have
used Edit Distance in word
suggestion i.e. suggesting
nearby related words.
TST

• A ternary search tree is a special


trie data structure where the child
nodes of a standard trie are ordered
as a binary search tree.
• Ternary search trees are more
efficient to perform applications
like spell-checking and auto-
completion.
• Autocomplete, or word
completion, is a feature in which
an application predicts the rest of a
word a user is typing.
• A spell checking is a feature that
checks for misspellings in a text.
QUICK SELECT
• Quickselect is a selection
algorithm to find the k-th
smallest element in an
unordered list.
• It is used calculating the
score of the page by
calculating the
occurrences of the word
and then ranks that page
correspondingly.
REGEX
• A regular expression, regex or reg-exp is a sequence
of characters that define a search pattern.
• In our search engine, Regex is used for word retrieval from HTML
file to create word dictionary.
• Regex is used for spell checking in which it checks whether the
entered word contains any special character or not.
HTTRACK
• It allows you to download a World Wide Web site from the
Internet to a local directory, building recursively all directories,
getting HTML, images, and other files from the server to your
computer.
• Website that we downloaded are:
• Wikipedia
• Python.org
• Oracle.org

You might also like