Introduction to Web Mining
Web mining is the process of applying data mining techniques to
discover useful information from the vast data available on the
web.
It focuses on extracting patterns, knowledge, and insights from
different types of web data such as web content, user behavior,
and link structures.
Web mining integrates methods from computer science, data
mining, machine learning, and statistics to analyze the web,
which includes web pages, web logs, social media data, and
online transactions.
Web mining enables businesses, organizations, and individuals
to make informed decisions, optimize content, improve user
experience, and uncover trends and hidden knowledge from the
web.
It is especially valuable for analyzing large volumes of
unstructured data, which is prevalent on the internet.
Features of Web Mining
1. Multidimensional Data:
o Web mining deals with multidimensional data that
can come from various sources, such as text
(articles, blogs, news), multimedia (images, videos),
and user-generated data (reviews, ratings, social
media posts).
2. Data Variety:
o Web data is diverse, including structured, semi-
structured, and unstructured formats. This variety
makes it challenging yet rewarding to mine data, as it
includes not just traditional databases but also web
pages, social media, emails, and multimedia content.
3. Scalability:
o The web generates vast amounts of data. Web mining
techniques must be able to scale effectively to handle
and process large datasets, sometimes involving
billions of records.
4. Dynamic and Evolving Data:
o Web data is constantly changing as websites are
updated, new content is added, and user behaviors
evolve. Web mining must be adaptable to capture this
dynamism.
5. Noise and Inconsistencies:
o The data on the web can be noisy, meaning it may
contain irrelevant, inaccurate, or incomplete
information. Web mining techniques often require
cleaning and pre-processing to extract meaningful
insights.
6. Interactivity:
o Web mining often involves real-time or near-real-time
data analysis. For example, analyzing user behavior or
social media trends as they occur.
Applications of Web Mining
Web mining has a wide range of applications across various
domains. Some of the prominent applications include:
1. Search Engine Optimization (SEO):
o Web structure mining is used to optimize search
engine rankings. Algorithms like PageRank analyze
the link structure of the web to determine which
pages are more authoritative and relevant,
improving search engine results.
2. Personalized Recommendations:
o Web usage mining helps to create personalized content
and product recommendations. For example, e-
commerce websites suggest products based on past
browsing history, user preferences, and behavior.
3. E-commerce and Online Retail:
o Web mining helps retailers analyze customer behavior,
predict trends, optimize product placements, and
improve sales strategies. By analyzing user clicks,
search queries, and purchases, online stores can better
understand consumer preferences.
4. Social Media Analysis:
o Web content mining and web usage mining are used
to analyze social media data for sentiment analysis,
trend analysis, and understanding user opinions. This
can help businesses with brand management, customer
service, and market analysis.
5. Web Analytics:
o Web mining is crucial for analyzing website
traffic and performance. It helps to understand
user navigation patterns, bounce rates, and which
pages are most
frequently visited. This information is used to
optimize websites for better user experience and
conversion rates.
6. Fraud Detection:
o Web usage mining can help in detecting fraudulent
activities like credit card fraud or fake reviews by
identifying unusual patterns of behavior on websites.
For example, detecting patterns of fraudulent activity
in e-commerce transactions or identifying bots that
skew social media engagement.
7. Content Curation and Marketing:
o Web content mining helps marketers identify relevant
content, monitor competitors, and track industry trends.
Content mining is also used to generate automated
summaries, categorize content, and optimize marketing
strategies.
8. Healthcare and Bioinformatics:
o Web mining is used to analyze online health data, patient
reviews, and medical research papers. By analyzing
this data, healthcare organizations can improve patient
care, identify medical trends, and personalize health
interventions.
9. Education:
o In online education, web mining helps in tracking
students' learning behaviors, analyzing online
discussion forums, and recommending learning
resources. It can also be used to assess the effectiveness
of educational content.
10. Political and Opinion Mining:
o Web mining is used to track public opinion, analyze
political campaigns, and monitor social media for
sentiment about political figures, events, or issues.
This can assist in understanding public perception
and sentiment trends.
Web Content Mining, Web Structure Mining, and Web Usage
Mining
Web mining is a broad field that focuses on extracting valuable
insights from the vast amount of data available on the web. Web
mining can be categorized into three main types based on the
type of web data they analyze:
1. Web Content Mining
2. Web Structure Mining
3. Web Usage Mining
Each of these types addresses a different aspect of the web and
serves distinct purposes. Below, we explore each type in more
detail:
1. Web Content Mining
Definition:
Web content mining refers to the process of extracting useful
information from the content of web pages. This includes both
structured content (like tables and forms) and unstructured content
(like text, images, and videos). The goal is to analyze and extract
meaningful data, patterns, and insights from this content.
Key Tasks in Web Content Mining:
∙ Text Mining: Analyzing textual content from web pages to
identify patterns, topics, and relationships. Techniques
like Natural Language Processing (NLP) are used for
extracting information and categorizing text.
o Example: Analyzing online reviews to determine
common customer sentiments about a product.
∙ Information Retrieval: Searching and retrieving
relevant data from web pages. This often involves
creating indexes for efficient querying and ranking
relevant content. o Example: Search engines like Google
retrieve web content based on a user’s search query.
∙ Sentiment Analysis: Determining the sentiment (positive,
negative, or neutral) behind user-generated content such
as product reviews, comments, and social media posts. o
Example: Identifying public sentiment toward a brand
based on tweets or online discussions.
∙ Multimedia Mining: Extracting and analyzing data
from multimedia content like images, videos, and
audio.
o Example: Recognizing and classifying images in online
stores or analyzing video content for particular themes.
∙ Document Classification: Categorizing documents into
different classes or topics based on their content.
o Example: Automatically classifying news articles into
categories like politics, sports, or entertainment.
Applications of Web Content Mining:
∙ Search Engines: Retrieve relevant content from the web
based on user queries. ∙ E-commerce: Analyzing product
descriptions and user reviews to recommend products. ∙
Social Media Analysis: Monitoring sentiment and
extracting key themes from social media platforms.
2. Web Structure Mining
Definition:
Web structure mining focuses on analyzing the link
structures of web pages. The primary objective is to
understand how web pages are connected through hyperlinks
and how the structure of these connections can reveal
patterns or relationships between websites.
Key Tasks in Web Structure Mining:
∙ Link Analysis: Examining the hyperlink structure of the web
to understand relationships between different web pages
and websites.
o Example: Analyzing how web pages are connected
through backlinks and how these connections affect
their importance.
∙ PageRank Algorithm: A famous algorithm used by search
engines like Google to rank web pages based on their
importance. It measures the quality and quantity of links
pointing to a page, under the assumption that more
important pages are linked to by more pages.
o Example: A page with a high number of backlinks
from authoritative sites is ranked higher.
∙ Web Graph Mining: Creating and analyzing the "web
graph," where websites are represented as nodes and
hyperlinks between them as edges. This analysis helps
in understanding the overall structure of the web.
o Example: Identifying clusters of pages that are
interconnected, such as communities of interest or
similar topics.
∙ Link Prediction: Predicting future links between web pages
based on their existing structure and relationships. This can
help in identifying emerging trends or relationships.
o Example: Predicting which new websites will link to a
particular topic or brand.
Applications of Web Structure Mining:
∙ Search Engine Optimization (SEO): Understanding link
structures helps in improving search engine rankings by
analyzing link patterns.
∙ Website Navigation: Optimizing website layout and
navigation by understanding which pages are most
commonly linked to.
∙ Social Network Analysis: Identifying key influencers or hubs
in a network based on the structure of connections.
3. Web Usage Mining
Definition:
Web usage mining is concerned with analyzing the behavior of
users on the web, including their interactions with websites.
This type of mining involves tracking user activities, such as page
visits, clicks, and browsing patterns, to extract insights that can
improve website performance, user experience, and business
strategies.
Key Tasks in Web Usage Mining:
∙ Log File Analysis: Analyzing web server logs to track users’
interactions with a website. This data includes page views,
clickstreams, and other user actions.
o Example: Analyzing web server logs to see which
pages are most frequently visited and how users
navigate through a website.
∙ Clickstream Analysis: Tracking and analyzing the
sequence of clicks made by users during their visit to a
website. This helps to identify user behavior patterns,
navigation flows, and potential bottlenecks.
o Example: Understanding how users move from one
page to another before making a purchase on an e-
commerce site.
∙ Sessionization: Grouping interactions into sessions to analyze
users' actions within a particular visit. A session is typically
a series of user interactions during a single visit to a website.
o Example: Understanding a user’s journey from landing
on a homepage to checking out an item in a
shopping cart.
∙ User Profiling: Building profiles based on users' behavior
and interactions on a website. This helps to understand
individual preferences, browsing habits, and interests. o
Example: Creating personalized recommendations for users
based on their browsing history, such as recommending
products on an e-commerce site.
∙ Behavioral Pattern Discovery: Identifying patterns and
trends in users' behaviors. This involves clustering users or
discovering frequent sequences of actions.
o Example: Identifying users who frequently browse a
particular category of products on an online store,
allowing the business to tailor offers and
advertisements.
Applications of Web Usage Mining:
∙ Personalized Recommendations: E-commerce sites, like
Amazon, use web usage mining to recommend products
based on user behavior.
∙ Website Optimization: Improving website design and
user experience by analyzing user navigation patterns
and behaviors.
∙ Targeted Advertising: Tailoring ads based on users'
browsing habits or search history. ∙ Fraud Detection:
Identifying unusual or suspicious behavior patterns, such as
bot activity or fraudulent transactions.
Summary Comparison of Web Mining Types
Aspect Web Web Structure Web Usage
Content Mining Mining
Focus Mining
User behavior
Content The link
and interactions
of web structure
Main on websites
pages between web
(text, pages
images, Log file
videos, analysis,
Tasks clickstream
etc.)
analysis,
Text Link analysis, user
mining, PageRank, profiling
sentiment web graph
analysis, mining
Understand
multimedi
Objective a mining user behavior
and optimize
web
Extract Analyze experiences
Applicatio information relationships
ns from web between web Personalized
page pages via recommendati
content hyperlinks ons, website
optimization,
Search SEO, website
fraud
engines, e navigation,
detection
commerce, social network
social analysis
media
analysis
Page Rank Algorithm
The page rank algorithm is applicable to web pages. The page
rank algorithm is used by Google Search to rank many websites
in their search engine results. The page rank algorithm was
named after Larry Page, one of the founders of Google. We can
say that the page rank algorithm is a way of measuring the
importance of website pages. A web page basically is a directed
graph which is having two components namely Nodes and
Connections. The pages are nodes and hyperlinks are
connections.
Let us see how to solve Page Rank Algorithm. Compute page
rank at every node at the end of the second iteration. use
teleportation factor = 0.8
So the formula is,
PR(A) = (1-β) + β * [PR(B) / Cout(B) + PR(C) / Cout(C)+ ...... +
PR(N) / Cout(N)]
HERE, β is teleportation factor i.e. 0.8
NOTE: we need to solve atleast till 2 iteration max.
Let us create a table of the 0th Iteration, 1st Iteration, and 2nd
Iteration.
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
NODES ITERATION 0 ITERATION 1 ITERATION 2
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141
Iteration 0:
For iteration 0 assume that each page is having page rank =
1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) =
1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
So, what have we done here is for node A we will see how many
incoming signals are there so here we have PR(B) and PR(C).
And for each of the incoming signals, we will see the outgoing
signals from that particular incoming signal i.e. for PR(B) we
have 4 outgoing signals and for PR(C) we have 2 outgoing
signals. The same procedure will be applicable for the remaining
nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR
FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.
Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
NOTE: USE THE UPDATED PAGE RANK FOR
FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)
= 0.4141
So, the final PAGE RANK for the above-given question is,
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
NODES ITERATION 0 ITERATION 1 ITERATION 2
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141
MINING MULTIMEDIA DATA ON THE WEB
The websites are flooded with the multimedia data like,
video, audio, images, and graphs. This multimedia data has
different characteristics.
The videos, images, audio, and pictures have different
methods of archiving and retrieving the information.
The multimedia data on the web has different properties this
is the reason the typical multimedia data mining techniques
cannot be applied.
This web-based multimedia has texts and links. The text and
links are the important features of the multimedia data to
organize web pages.
The better organization of web pages helps in effective
search operation.
The web page layout mining can be applied to segregate the
web pages into the set of multimedia semantic blocks from
non-multimedia web pages.
There are few web-based mining terminologies and
algorithms to understand.
PageRank: This measure is used to count the number of
pages the webpage is connected to other websites. It gives
the importance of the webpage. The Google search engine
uses the algorithm PageRank and rank the web page very
significant if is frequently connected with the other
webpages on the social network.
It works on the concept of probability distribution
representing the likelihood that a person on random click
would reach to any page.
It is assumed the equal distribution in the beginning of the
computational process.
This measure works on iterations.
Iterating or repetition of page ranking process would help
rank the web page closely reflecting to its true value.
HITS:
This measure is used to rate the webpage.
It was developed by Jon Kleinberg.
It uses hubs and authorities to be determined from a web
page.
Hubs and Authorities define a recursive relationship between
web pages. •
• This algorithm helps in web link structure and speeds
up the search operation of a web page. Given a query to
a Search Engine, the set of highly relevant web pages
are called Roots.
They are potential Authorities. Pages that are not very
relevant but point to pages in the Root are called Hubs.
Thus, an Authority is a page that many hubs link to whereas
a Hub is a page that links to many authorities.
How do Search Engines Work?
Search engine crawling is the technique used by search engines
to visit websites and go via links to other websites. Crawlers,
commonly referred to as spiders or bots, are used by search
engines to browse the web.
Crawlers begin by obtaining a list of URLs from several sources,
including sitemaps, links from other websites, and the robots.txt
file on each website.
Afterward, they go to each of these URLs and click the links on
the corresponding sites to find fresh URLs.
This procedure keeps going until the bot has thoroughly crawled
the internet.
A crawler visits a website, extracts the information, and then
stores the content in a database. The search engine uses this
database, which it refers to as the index, to obtain results when
users do a search.
Step 1 - Crawling
Web Crawlers scan the internet for web pages. They follow the
URL links from one page to another and store URLs in the URL
store. The crawlers discover new content, including web pages,
images, videos, and files.
▶️ Step 2 - Indexing
Once a web page is crawled, the search engine parses the page and
indexes the content found on the page in a database. The content is
analyzed and categorized. For example, keywords, site quality,
content freshness, and many other factors are assessed to
understand what the page is about.
▶️ Step 3 - Ranking
Search engines use complex algorithms to determine the order of
search results. These algorithms consider various factors, including
keywords, pages' relevance, content quality, user engagement, page
load speed, and many others. Some search engines also personalize
results based on the user's past search history, location, device, and
other personal factors.
▶️ Step 4 - Querying
When a user performs a search, the search engine sifts through its
index to provide the most relevant results.