Why Engine?
Finding key information
from gigantic World Wide
Web is similar to find a
needle lost in haystack.
For this purpose we
would use a special
magnet that would
automatically, quickly
and effortlessly attract
that needle for us.
In this scenario magnet
is “Search Engine”
“Even a blind squirrel finds a nut ,
occasionally.” But few of us are
determined enough to search through
millions, or billions, of pages of
information to find our “nut.” So, to
reduce the problem to a, more or less,
manageable solution, web “search
engines” were introduced a few years
ago.
Search Engine
A software program that searches a
database and gathers and reports
information that contains or is related to
specified terms.
OR
A website whose primary function is
providing a search for gathering and
reporting information available on the
Internet or a portion of the Internet.
Eight reasonably well-known
Web search engines are : -
Top 10 Search Providers by Searches, August 2007
Provider Searches Share of
(000) Total
4,199,495 Searches
53.6
1,561,903 (%)
19.9
1,011,398 12.9
435,088 5.6
136,853 1.7
71,724 0.9
37,762 0.5
34,699 0.4
32,483 0.4
31,912 0.4
Other 275,812 3.5
All Search 7,829,129 100.0
Source: Nielsen//NetRatings, 2007
Search Engine History
1990 - The first search engine Archie was released .
There was no World Wide Web at the time.
Data resided on defense contractor , university,
and government computers, and techies were the only
people accessing the data.
The computers were interconnected by Telenet .
File Transfer Protocol (FTP) used for
transferring files from computer to computer.
There was no such thing as a browser.
Files were transferred in their native format and
viewed using the associated file type software.
Archie searched FTP servers and indexed their
files into a searchable directory.
1991 - Gopherspace came into existence with the
advent of Gopher.
Gopher cataloged FTP sites, and the
resulting catalog became known as Gopherspace .
1994 - WebCrawler, a new type of search engine
that indexed the entire content of a web page ,
was introduced.
Telenet / FTP passed information among
the new web browsers accessing not FTP sites but
WWW sites.
Webmasters and web site owners begin
submitting sites for inclusion in the growing
number of web directories.
1995 -Meta tags in the web page were first utilized
by some search engines to determine relevancy.
1997 - Search engine rank-checking software was
introduced. It provides an automated tool to
determine web site position and ranking within the
major search engines.
1998 - Search engine algorithms begin incorporating
esoteric information in their ranking algorithms.
E.g. Inclusion of the number of links to a web
site to determine its “link popularity.” Another
ranking approach was to determine the number of
clicks (visitors) to a web site based upon keyword
and phrase relevancy.
2000 - Marketers determined that pay-per click
campaigns were an easy yet expensive approach to
gaining top search rankings. To elevate sites in the
search engine rankings web sites started adding useful
and relevant content while optimizing their web pages
for each specific search engine.
Stages in information retrieval
Finding documents: It is potentially needed to find
interesting documents on the Web consists of millions of
documents, distributed over tens of thousands of
servers.
Formulating queries: It needed to express exactly
what kind of information is to retrieve.
Determining relevance: The system must determine
whether a document contains the required information
or not.
Types of Search Engine
On the basis of working, Search engine is
categories in following group :-
Crawler-Based Search Engines
Directories
Hybrid Search Engines
Meta Search Engines
Crawler-Based Search Engines
•It uses automated software programs to survey and categories
web pages , which is known as ‘spiders’, ‘crawlers’, ‘robots’ or
‘bots’.
•A spider will find a web page, download it and analyses the
information presented on the web page. The web page will
then be added to the search engine’s database.
•When a user performs a search, the search engine will check
its database of web pages for the key words the user searched.
•The results (list of suggested links to go to), are listed on pages
by order of which is ‘closest’ (as defined by the ‘bots).
Examples of crawler-based search engines are:
•Google (www.google.com)
•Ask Jeeves (www.ask.com)
Robot Algorithm
All robots use the following algorithm for retrieving
documents from the Web:
1.The algorithm uses a list of known URLs. This list
contains at least one URL to start with.
2.A URL is taken from the list, and the corresponding
document is retrieved from the Web.
3.The document is parsed to retrieve information for the
index database and to extract the embedded links to
other documents.
4.The URLs of the links found in the document are added
to the list of known URLs.
5.If the list is empty or some limit is exceeded (number of
documents retrieved, size of the index database, time
elapsed since startup, etc.) the algorithm stops.
otherwise the algorithm continues at step 2.
Crawler program treated World Wide Web as
big graph having pages as nodes And the
hyperlinks as arcs.
Crawler works with a simple goal: indexing all
the keywords in web pages’ titles.
Three data structure is needed for crawler or
robot algorithm
•A large linear array , url_table
•Heap
•Hash table
Url_table :
•It is a large linear array that contains
millions of entries
•Each entry contains two pointers –
•Pointer to URL
•Pointer to title
These are variable length strings and
kept on heap
Heap
It is a large unstructured chunk of virtual
memory to which strings can be
appended.
Hash table :
•It is third data structure of size ‘n’ entries.
•Any URL can be run through a hash
function to produce a nonnegative
integer less than ‘n’.
•All URL that hash to the value ‘k’ are
hooked together on a linked list starting
at the entry ‘k’ of the hash table.
•Every entry into url_table is also entered
into hash table
•The main use of hash table is to start with a
URL and be able to quickly determine
whether it is already present in url_table.
Data structure for crawler
Pointers Pointers
to URL to title Hash Overflow
chains
Code
2
String storage 0
URL 4
1 1
Title 5 6
9
2 2 4
URL 1 4
U 3
Title
Heap
n
Url_table Hash table
Building the index requires two phases :
Searching (URL proceesing )
Indexing.
The heart of the search engine is a recursive
procedure procees_url, which takes a URL
string as input.
Searching is done by procedure,
procees_url as follows :-
•It hashes the URL to see if it is already present in
url_table. If so, it is done and returns immediately.
•If the URL is not already known, its page is
fetched.
•The URL and title are then copied to the heap and
pointers to these two strings are entered in
url_table.
•The URL is also entered into the hash table.
•Finally, process_url extracts all the hyperlinks
from the page and calls process_url once per
hyperlink, passing the hyperlink’s URL as the
input parameter
This design is simple and theoretically correct,
but it has a serious problem
•Depth-first search is used which may cause
recursion .
•Path-length is not pridictable it may be thousands
of hyperlinks long which cause memory problem
such as ,stack overflow .
Solution
•Processed URL s are removed from the list
and Breadth-first search is used to limit path-
length
•To avoid memory problem pointed pages are
not traced in same order as they obtained.
keyword Indexing
The stop list prevents indexing
of prepositions, conjunctions,
articles, and other words with
For each entry in url_table, indexing procedure will
many hits and little value.
examine the title and selects out all words not on
the stop list.
Each selected word is written on to a file with a line
consisting of the word followed by the current
url_table entry number.
when the whole table has been scanned , the file is
shorted by word.
Formulating Queries
Keyword submission cause a POST request to be
done to a CGI script on the machine where the index
is located.
The CGI script then looks up the keyword in the
index to find the set of URl_table indices for each
keyword .
if the user wants the Boolean and of the keywords
the set intersection is computed.
If the Boolean or is desired the set union is
computed.
The script now indexes into url_table to find all the
titles and urls. These are then combined to form a web
page and sent back to user as the response of the
POST .
Determining Relevance
Classic algorithm "TF / IDF“ is used for
determining relevance.
It is a weight often used in information retrieval
and text mining.This weight is a statistical measure
used to evaluate how important a word is to a
document in a collection
A high weight in tf–idf is reached by a high term
frequency (in the given document) and a low
document frequency of the term in the whole
collection of documents
Term Frequency
•The “term frequency” in the given document is
simply the number of times a given term appears in
that document.
•It gives a measure of the importance of the term ti
within the particular document.
Term Frequency,
where ni is the number of occurrences of the
considered term, and the denominator is the
number of occurrences of all terms.
Term Frequency
The term frequency (TF) is the number of times the
word appears in a document divided by the number
of total words in the document.
For Example ,
If a document contains 100 total words
and the word computer appears 3 times, then the
term frequency of the word computer in the
document is 0.03 (3/100)
Inverse Document Frequency
The “inverse document frequency ”is a measure of
the general importance of the term (obtained by
dividing the number of all documents by the
number of documents containing the term, and then
taking the logarithm of that quotient).
Where,
•| D | : total number of documents in the corpus
• : number of documents where the
term ti appears (that is ).
Inverse Document Frequency
There are many different formulas used to calculate tf–
idf.
One way of calculating “document frequency”
(DF) is to determine how many documents contain the
word and divide it by the total number of documents in
the collection.
For Example ,If the word computer appears in 1,000
documents out of a total of 10,000,000 then the
document frequency is 0.0001 (1000/10,000,000).
Alternatives to this formula are to take
the log of the document frequency. The natural
logarithm is commonly used. In this example we
would have
idf = ln(1,000 / 10,000,000) =1/ 9.21
Inverse Document Frequency
The final tf-idf score is then calculated by
dividing the “term frequency” by the
“document frequency”.
For our example, the tf-idf score for
computer in the collection would be :
• tf-idf = 0.03/0.0001= 300 , by using first
formula of idf.
• If alternate formula used we would have
tf-idf = 0.03 * 9.21 = 0.27.
Directories
•A ‘directory’ uses human editors who decide
what category the site belongs to.
•They place websites within specific
categories or subcategories in the ‘directories’
database.
•By focusing on particular categories and
subcategories, user can narrow the search to
those records that are most likely to be
relevant to his/her interests.
Directories
•The human editors comprehensively check the
website and rank it, based on the information
they find, using a pre-defined set of rules.
There are two major directories :
Yahoo Directory (www.yahoo.com)
Open Directory (www.dmoz.org)
Hybrid Search Engines
Hybrid search engines use a combination of
both crawler-based results and directory
results.
Examples of hybrid search engines are:
Yahoo (www.yahoo.com)
Google (www.google.com)
Meta Search Engines
•Also known as Multiple Search Engines or
Metacrawlers.
•Meta search engines query several other Web
search engine databases in parallel and then
combine the results in one list.
Examples of Meta search engines include:
Metacrawler (www.metacrawler.com)
Dogpile (www.dogpile.com)
Pros and Cons of Meta Search Engines
Pros :-
Easy to use
Able to search more web pages in less
time.
High probability of finding the desired
page(s)
It will get at least some results when
no result had been obtained with
traditional search engines.
Pros and Cons of Meta Search Engines
Cons :-
Metasearch engine results are less relevant, since it
doesn’t know the internal “alchemy” of search engine
used.
Since, only top 10-50 hits are retrieved from each
search engine, the total number of hits retrieved may
be considerably less than found by doing a direct
search.
Advanced search features (like, searches with
boolean operators and field limiting ; use of " ", +/-.
default AND between words e.t.c.) are not usually
available.
Meta Search Engines Cont….
Meta- Ad Special
Primary Web
Search Databas Feature
Databases
Engine es s
Vivisimo Ask, MSN, Gigablast, Looksmart, Google Clusters
Open Directory, Wisenut results
Ask, MSN, Gigablast, Looksmart, Clusters
Clusty Google
Open Directory, Wisenut results
Ixquick AltaVista, EntireWeb, Gigablast, Yahoo
Go, Looksmart,Netscape, Open
Directory,Wisenut, Yahoo
Dogpile Ask, Google, MSN, Yahoo!, Teoma, Google, All top 4
Open Directory, more Yahoo engines
Mamma About, Ask, Business.com, Miva, Refine
EntireWeb, Gigablast, Open Ask options
AlltheWeb, AltaVista, EntireWeb,
Directory,Wisenut
Exalead, Hotbot, Looksmart, Visual
Kartoo Lycos, MSN, Open Directory, ?? results
Teoma, ToileQuebec, Voila, display
Wisenut, Yahoo
Meta Search Engines (MSEs)
Come In Four Flavors
• "Real" MSEs which aggregate/rank the
results in one page
• "Pseudo" MSEs type I which exclusively
group the results by search engine
• "Pseudo" MSEs type II which open a
separate browser window for each search
engine used and
• Search Utilities, software search tools.
THANK
YOU