CS407: Introduction to IT
Internet, World Wide Web and Search Engine
Network Components
• Sender
• Receiver
• Message
• Transmission Medium and
• Protocol
Types of Data
• Text
ASCII representation
• Numbers
• Images Pixels – RGB/YCM
• Audio
Continuous data – different from above three
• Video -concept of jitter
Network Criteria
• Performance – transit time or response time
• Reliability – accuracy of delivery
• Security –
• protection from unauthorized access
• protection from damage and development, and
• Implementation of policies and procedures for recovery from breaches and
data losses
Physical Structure of Network – Topology
Categories of Networks
• Local Area Network
• Metropolitan Area Network
• Wide Area Network
Local Area Network
• Privately owned network
• Limited to small area
• Size limited to a few kilometers
• Common topology for all devices
• Single transmission medium
• Speed – normally 100 to 1000 Mbps
Wide Area Network
• Long distance transmission of data
• Large geographical areas – country, continent or even whole world
• Can be as complex as the backbones that connect the Internet or
• As simple as a dial-up line that connects a home computer to the
Internet.
• We normally refer to the first as a switched WAN and to the second as
a point-to-point WAN
Metropolitan Area Network
• Size between LAN and WAN
• Size is limited to within a city
Internet
• Began as an academic research project in 1969 and later became a
global commercial internetwork by 1990s
• Completely decentralized structure:
• No one owns the Internet
• No one decides who can connect to it
Brief History of the Internet
• It started as an academic research network that was funded by the US
military's Advanced Research Projects Agency (ARPA)
• It derives its original name ARPANET from there
• The network began its operation in 1969
• The funding was later shifted to the National Science Foundation
(NSF). It helped creation and operation of long distance networks that
now form the backbone of the Internet
• Later, the control of Internet was given to private agencies and it is
being operated privately since then
Present Day Internet
• It can be viewed as an organized decentralized network of networks
• Voluntary connection agreements govern the connection which is
managed by various public, private organizations, universities and
governments etc.
Components of the Internet
• Last mile
Internet • Data Centers
• Backbone
Users Content Network
Last Mile Connections
• These are home users or small businesses that connect to the
Internet for communication and information exchange, data storage
and processing etc.
• Connection is provided through:
• Cable TV networks
• Optic Fiber Cables
• DSL connection (rarely used now)
• Wireless services
Data Centers
• These are servers that store user data and also host a number of
websites and applications
• Either owned by large companies like Google, Facebook, Amazon etc.
or by commercial facilities that host smaller websites
• Have very fast Internet connection and can serve thousands of users
simultaneously
• Often located in remote areas where there are low electricity and
land charges
Image showing Google’s EU Data Center
Image showing inside view of one of Google’s Data Centers
Backbone
• Consists of long-distance networks
• These are mostly optical fiber cables (OFCs) that carry data between
data centers and consumers
• Backbone providers connect their networks at Internet Exchange
Points (IEPs)
• This helps them in improving their connections with each other
The Internet Today
• The Internet today is not a simple hierarchical structure.
• Made up of many wide- and local-area networks joined by connecting
devices and switching stations.
• Today most end users who want Internet connection use the services
of Internet service providers (lSPs).
• There are international service providers, national service providers,
regional service providers, and local service providers.
• The Internet today is run by private companies and not the
government
International and National ISPs
• International Internet Service Providers:
• At the top of the hierarchy are the international service providers that
connect nations together.
• National Internet Service Providers:
• The national Internet service providers are backbone networks created and
maintained by specialized companies.
• To provide connectivity between the end users, these backbone networks are
connected by complex switching stations (normally run by a third party) called
network access points (NAPs).
• Some national ISP networks are also connected to one another by private
switching stations called peering points.
• These normally operate at a high data rate (up to 600 Mbps).
Regional and Local ISPs
• Regional Internet Service Providers:
• Regional internet service providers or regional ISPs are smaller ISPs that are
connected to one or more national ISPs.
• They are at the third level of the hierarchy with a smaller data rate
• Local Internet Service Providers:
• Local Internet service providers provide direct service to the end users.
• The local ISPs can be connected to regional ISPs or directly to national ISPs. Most end
users are connected to the local ISPs.
• Note that in this sense, a local ISP can be a company that just provides Internet
services, a corporation with a network that supplies services to its own employees,
or a nonprofit organization, such as a college or a university, that runs its own
network.
• Each of these local ISPs can be connected to a regional or national service provider.
Protocols and Standards
• Protocol - synonymous with rule.
• A protocol is a set of rules that govern data communications.
• Defines - what is communicated, how it is communicated, and when
it is communicated.
• The key elements of a protocol are syntax, semantics, and timing.
Standards
• Standards provide guidelines to manufacturers, vendors, government
agencies, and other service providers to ensure the kind of
interconnectivity necessary in today's marketplace and in
international communications
• Standards are essential in creating and maintaining an open and
competitive market for equipment manufacturers and in
guaranteeing national and international interoperability of data and
telecommunications technology and processes
Types of Standards
• Data communication standards fall into two categories:
• de facto (meaning "by fact" or "by convention") and
• de jure (meaning "by law" or "by regulation").
• De facto - Standards that have not been approved by an organized
body but have been adopted as standards through widespread use
are de facto standards.
• De facto standards are often established originally by manufacturers
who seek to define the functionality of a new product or technology
• De jure - Those standards that have been legislated by an officially
recognized body are de jure standards
Standards Organizations
• International Organization for Standardization (ISO).
• International Telecommunication Union-Telecommunication
Standards Sector (ITU-T).
• American National Standards Institute (ANSI)
• Institute of Electrical and Electronics Engineers (IEEE)
The Internet Technical Standards
• The technical standards are managed by an organization named as
Internet Engineering Task Force (IETF)
• This is an open organization- meetings can be attended by anyone,
anyone can propose new standards or recommend changes to
existing
• No one is required to adopt the standards endorsed by IETF however,
the Internet community generally adopts the suggested standards
because of the open consensus-based process adopted by IETF
Addressing on the Internet
• Each device on the Internet be it a user computer or a server needs to
be identified by a unique address in order for the communication to
take place
• The IP Address or the Internet Protocol address is a unique number
that is used to identify a computer on the Internet
• The Internet Corporation for Assigned Names and Numbers (ICANN)
is responsible for issuing domain names as well as IP addresses
• It ensures that no two organizations use the same address
IPv4 v/s IPv6
• The original IP addressing mechanism used a 32-bit address scheme
for addressing computers on the Internet
• The IP addresses used to be in the form of 4 sets of 8-bit each For
example, 66.94.29.13
• Each octet has a range 0-255 and therefore, a total of
4,294,967,296 i.e. around 4 billion unique addresses can be
generated
• As the size of the Internet grew manifold, the number of possible IP
addresses was unable to support it
IPv6
• The current size of the Internet is 7.6 billion users as compared to 4
billion addresses supported by IPv4
• As the size of the Internet grew enormously, the Internet engineers
developed a new standard for addressing i.e. IPv6
• IPv6 is a 128-bit hexadecimal addressing i.e. it contains addresses in
the form of 16 sets of 8 bit each
• It allows a very large number of unique addresses (39-digit long
number) ensuring that the world will never run out of IP addresses
again
Advantages of IPv6
• Larger Address Space
• Support for more security
• Support for resource allocation: allows real-time audio and video
support
4 – parts of 8-bit each
8 – parts of 16-bit each
E-mail File Transfer Services
using
WWW
World Wide Web
• Popular way of publishing information on the Internet
• Accessed through a client known as web browser
• It was developed by Timothy Berners Lee working in the European
Scientific Research organization CERN in 1991
• Provides a powerful and user-friendly interface for accessing content
on the Internet
Indicative Image of the World Wide Web
Structure of the World Wide Web
• The world wide web is a system of interconnected documents or other
resources on the Internet where each document/resource is identified by a
Uniform Resource Locator (URL)
• The URL supports hyperlinks that is interconnection between the
documents which allows accessing one document from within the other
• The resource on the web are transferred via the Hyper Text Transfer
Protocol (HTTP) and can be accessed by web browsers
• The World Wide Web Consortium (W3C) is responsible for standardization
of WWW however, the recommendations are not compelling to the
content creators
• Practically, the major players in the web browser market i.e. Google, Apple,
Mozilla and Apple influence the standards and whatever technology is
adopted by them becomes the de facto web standard
Content Creation for World Wide Web
• Web pages are hypertext documents that are created using Hyper
Text Markup Language (HTML)
• The navigation from one page to other as also on different locations
on the same page is supported by special HTML syntax that embeds
the URLs
• Initially, WWW contained only textual documents but nowadays web
pages support specialized multimedia contents such as audio, video,
images and software components
• This has led to immense enhancement in the application of World
Wide Web and nowadays various sophisticated client applications are
supported on the WWW including content generation websites
Search Engine
• Search Engines are sophisticated programs that provide the facility to
search the world wide web and retrieve information specified in the
form of a web search query
• Designing a search engine is complex because the program has to
search over the enormous unstructured data available on the web
that are only connected through hyperlinks and present the most
relevant information to the user in a short time
• Example of some commonly used search engines are: Google, Bing,
Yahoo, DuckDuckGo etc.
Working of a Search Engine
• In its most basic form the search engine performs the following steps
for finding relevant information on the web:
• Crawling
• Indexing
• Searching
• Ranking
• Crawlers: These are computer programs known as crawlers or spiders
that move from one webpage to the other using hyperlinks on the
world wide web and record the contents of the webpages. E.g. the
crawler used by the google is called Googlebot
Indicative image of a web crawler
Web Crawler working
• It goes to every page on the WWW by following links to websites
• Usually there is a limit to how much content from a single page will
be indexed
• If there are some specific pages that the website owner think are
most important they can be kept in special file from where the
crawler can access and index those pages
• If a page is hidden behind a form or image or a protected content
then it will not be indexed
Indexing
• Huge databases are maintained that store the information collected
by the web crawler
• The crawler identifies important information on each web page and
sends it to these databases
• The databases are maintained as inverted index much like the index
that you see at the end of a book
• For each word, you get an entry giving information about the page
numbers (web pages in case of search engines) where that word is
present
Ranking
• While searching the index, the search engine tries to find the matches
to the query given by the user
• Usually there are thousands or even lakhs of matches for a particular
query
• The web pages are therefore ranked according to their relevance to
the user query and they are displayed in a ranked order to the viewer
Search Algorithms
• These are basically matching algorithms that try to find the best
match for the query given by the user
• It is important to understand that the match for a query only comes
from the index that the search engine has saved and therefore if a
page is not indexed on a particular search engine then the content of
those page will not appear in search result even if they are related to
the query
• The two most important factors for determining relevance of a page
are: content and links that form the basis of the PageRank algorithm
other factors include location, device type, language, previous search
history etc.
Commercial Search Engines: Making Money
• Organic Search – This search is done automatically by the search
engine, the results that appear as a result of the search query are
purely dependent on the method of indexing and the algorithms used
for searching and ranking
• Paid Search – These are the search results for which advertisers pay.
The search engine does the additional task to show the relevant
advertisement to the user so that the advertisers get new customers
Personalization of Search Results
• Personalization implies finding results that are relevant to a particular
user
• This is based on the specific information about the user that are
saved with the search engine
• This could be the age, gender, previous search history, time and
location etc.
• This requires profiling of the users of the system