WEB MINING
Presentation 1
CSE 590 DATA MINING
Prof. Anita Wasilewska
SUNY Stony Brook
Presented By:
Alka Simha 106677801
Avanthi Gupta 106616697
Megha Krishnamurthy 106616749
REFERENCES
• Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber
• Presentation Slides of Prof. Anita Wasilewska
• http://en.wikipedia.org/wiki/Web_mining
• http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf
• http://searchcrm.techtarget.com/sDefinition/0,,sid11_gci789009,00.html
• http://www.cs.rpi.edu/~youssefi/research/VWM/
• http://www.galeas.de/webimining.html
• R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations,
2(1):1-15, 2000.
• R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide
web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999
• S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD
Explorations, 1(2):1-11, 2000System, 1(1), 1999
• Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti
• Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents.
Proc. Fifth International World Wide Web Conference, May 6-10 1996.
OVERVIEW
• What is Web Mining
• Challenges in Web Mining
• Data Mining V/s Web Mining
• Classification or Taxonomy
• Applications of Web Mining
• Conclusion
What is Web Mining
• The web as we all know is the SINGLE largest
source of data available.
• Web mining aims to extract and mine useful
knowledge from the web.
• It is used to understand the customer behavior,
evaluate the effectiveness of a website and also
to help quantify the success of a marketing
campaign.
• Due to the large availability of data the world wide
web, it has become very important for users to use
automated tools to find the desired information
resources.
• For example a user uses Google or Yahoo search for
finding information.
• These factors thus give rise to the necessity of
creating server and client side intelligent systems
which can effectively mine for knowledge.
• The information gathered through the Web is further
evaluated by using traditional data mining techniques
such as clustering, classification and association.
SEARCHING THE WEB
http://infolab.stanford.edu/~ullman/mining/2008/slides/web_mining_overview.pdf
HOW BIG IS THE WEB
224,749,695 (Mar 2009)
Netcraft survey – Total no of sites across all domains
http://news.netcraft.com/archives/web_server_survey.html
CHALLENGES IN WEB
MINING
• Finding useful and relevant information.
• Creating knowledge from available information.
• As the coverage of information is very wide and diverse, personalization
of the information is a tedious process.
• Learning customer and individual user patterns.
• Much of the web information is redundant, as the same piece of
information or its variant appears in many pages.
• The web is noisy i.e. a page typically contains a mixture of many kinds
of information like, main content, advertisements, copyright notice,
navigation panels.
• The web is dynamic, information keeps changing constantly. Keeping
up with the changes and monitoring them are very important.
• The Web is also about services. Many Web sites and pages enable
people to perform operations with input parameters, i.e., they provide
services.
• The most important challenge faced is Invasion of Privacy. Privacy is
considered lost when information concerning an individual is obtained,
used, or disseminated, when it occurs without their knowledge or
consent.
http://en.wikipedia.org/wiki/Web_mining
USES OF WEB MINING
• This technology has enabled ecommerce to do personalized marketing,
which eventually results in higher trade volumes.
• The predicting capability of the mining application can benefit the society by
identifying criminal activities.
• The companies can establish better customer relationship by giving them
exactly what they need.
• Companies can understand the needs of the customer better and they can
react to customer needs faster.
• The companies can find, attract and retain customers, they can save on
production costs by utilizing the acquired insight of customer requirements.
• They can increase profitability by target pricing based on the profiles
created.
• They can even find the customer who might default to a competitor the
company will try to retain the customer by providing promotional offers to
the specific customer, thus reducing the risk of losing a customer.
http://en.wikipedia.org/wiki/Web_mining
WEB MINING vs DATA MINING
STRUCTURE
¾ Data Mining
Data is structured and has well defined tables,
columns, rows, keys and constraints.
¾ Web Mining
Dynamic and rich in features and patterns.
• Web mining involves analysis of web server logs of a website
whereas data mining involves using techniques to find
relationships in large amounts of data.
• SPEED
¾ Often need to react to evolving usage patterns in real time eg.
Merchandizing.
http://www.information-management.com/news/5458-1.html
WEB CRAWLERS
• A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner. Other terms for Web crawlers are ants, automatic
indexers, bots, and worms or Web spider, Web robot
• Search engines, use spidering as a means of providing up-to-date data
• Crawlers can also be used for automating maintenance tasks on a Web site, such as
checking links or validating HTML code.
• Crawlers can be used to gather specific types of information from Web pages, such
as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu
; mueller{remove this}@cs.sunysb.edu
• A Web crawler is one type of bot, or software agent. In general, it starts with a list
of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl
frontier
April 21, 2009 Web Mining 13
WEB MINING TAXONOMY
Web Mining
Web Content Web Structure Web Usage
Mining Mining Mining
Identify information Infer knowledge from Also known as Web
within given web the World-Wide Web Log Mining
pages organization and the
links between Extract interesting
Distinguish personal references and patterns and trends in
home pages from referents in the Web web access logs
other web pages
WEB CONTENT MINING
• Discovery of useful information from web contents / data / documents
– Web data contents: text, image, audio, video, metadata and hyperlinks
• Pre-processing data before web content mining: feature selection
• Post-processing data can reduce ambiguous searching results
• Web Page Content Mining:
– Mines the contents of documents directly
• Search Engine Mining:
– Improves on the content search of other tools like search engines
• Web Content Mining is related to data mining and text mining
– It is related to data mining because many data mining techniques can be
applied in Web content mining
– It is related to text mining because much of the web content is text
Issues in Web Content Mining
• Developing intelligent tools for IR
– Finding keywords for key phrases
– Discovering grammatical rules and collocations
– Hypertext classification/categorization
– Extracting key phrases from text documents
– Learning extraction models/rules
– Hierarchical clustering
– Predicting words relationship
• Developing Web query systems
– WebOQL, XML-QL
• Mining multimedia data
– Mining image from satellite (Fayyad, et al. 1996)
– Mining image to identify small volcanoes on Venus (Smyth, et al 1996)
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB STRUCTURE MINING
• The structure of a typical Web graph consists of Web pages as nodes, and
hyperlinks as edges connecting two related pages
• Web Structure Mining is the process of discovering information from the Web
• Finding information about the web pages and inference on Hyperlink
• Retrieving information about the relevance and the quality of the web page
• This type of mining can be performed either at the (intra-page) document
level or at the (inter-page) hyperlink level
• Finding authoritative Web pages
– Retrieving pages that are not only relevant but are also of high quality, or
authoritative on the topic
WEB STRUCTURE MINING
• Hyperlinks can infer the notion of authority
– The Web consists not only of pages, but also of hyperlinks pointing from
one page to another
– These hyperlinks contain an enormous amount of latent human annotation
– A hyperlink pointing to another Web page, this can be considered as the
author's endorsement of the other page
• To discover the link structure of the hyperlinks at the inter-document level
and to generate structural summary about the Website and Web page:
– Based on the hyperlinks, categorizing the Web pages and generated
information
– Discovering the structure of Web document itself
– Discovering the nature of the hierarchy or network of hyperlinks in the
Website of a particular domain
• The research at the hyperlink level is also called Hyperlink Analysis
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
• Web usage mining also known as Web log mining
• What is Usage mining?
– Discovering user ‘navigation patterns’ from web data
– Prediction of user behavior while he interacts with the web
– Helps to improve large collection of resources
• Typical sources of data:
– Automatically generated data stored in server access logs, referrer
logs, agent logs and client-side cookies
– User profiles
– Meta data: Page attributes, content attributes, usage data
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
• Applications:
– Target potential customers for electronic commerce
– Enhance the quality and delivery of Internet information
services to the end user
– Improve Web server system performance
– Identify potential prime advertisement locations
– Facilitates personalization of sites
– Improve site design
– Fraud/intrusion detection
– Predict user’s actions (allows pre-fetching)
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Problems with Web Logs
• Typically a 30 minute timeout is used
• Web content may be dynamic
– May not be able to reconstruct what the user saw
• Use of spiders and automated agents – automatic request web pages
• Like most data mining tasks, web log mining requires preprocessing
– To identify users
– To match sessions to other data
– To fill in missing data
– Essentially, to reconstruct the click stream
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Problems with Web Logs
• Identifying users
– Clients may have multiple streams
– Clients may access web from multiple hosts
– Proxy servers: many clients/one address
– Proxy servers: one client/many addresses
• Data not in log
– POST data (i.e., CGI request) not recorded
– Cookie data stored elsewhere
• Other issues
– When does a session end
– Pages may be cached
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Web Log – Data Mining
Applications
• Association rules
– Find pages that are often viewed together
• Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification
– Relate user attributes to patterns
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Web Logs
• Web servers have the ability to log all requests
• Web server log formats:
– Most use the Common Log Format (CLF)
– New, Extended Log Format allows configuration of log file
• Design of a Web Log Miner:
– Web log is filtered to generate a relational database
– A data cube is generated from the database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge Knowledge
Web log Database Data Cube Sliced and diced
cube
R
(q)
(p)=ε/n+(1−ε)⋅∑
R
Gou
(q,p)∈ de
re
(q)
1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
Web Logs
http://mate.dm.uba.ar/~pfmislej/web%20mining/web%20mining.pdf
WEB MINING APPLICATIONS
• Personalization, Recommendation engines
• Web-commerce applications
• Intelligent web search
• Hypertext classification and Categorization
• Information/trend monitoring
• Analysis of online communities
• Improving the relationship between the website and the user
– Recommendations to modify the web site structure and content
– Web personalization
– Intelligent web site – They are systems that “based on the user
behavior, allow implementation of changes to the current web site
structure and content”
paginas.fe.up.pt/~ec/files_0506/slides/06_WebMining.pdf
Personalization of Webpages
http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf
CONCLUSION
• Web has been adopted as a critical communication and
information medium by a majority of the population.
• Web data is growing at a significant rate.
• A number of new Computer Science concepts and techniques
have been developed.
• Many successful applications exist.
• Fertile area of research.
• Privacy –real debate needed.
VISUAL WEB MINING
Presentation 2
CSE 590 DATA MINING
Prof. Anita Wasilewska
SUNY Stony Brook
Presented By:
Alka Simha 106677801
Avanthi Gupta 106616697
Megha Krishnamurthy 106616749
Visual Web Mining
WWW2004, May 17–22, 2004, New York, New York, USA.
ACM 1-58113-912-8/04/0005
Amir H. Youssefi David J. Duke Mohammed J. Zaki
Rensselaer Polytechnic Institute University of Bath Rensselaer Polytechnic
Institute
youssefi@cs.rpi.edu d.duke@bath.ac.uk zaki@cs.rpi.edu
International World Wide Web Conference
May 17 – 22, 2004
References
• http://www.cs.rpi.edu/~zaki/PS/WWW04.p
df
• http://www.cs.rpi.edu/~youssefi/research/V
WM/
• http://www.vtk.org/
• http://www.w3.org/Robot/
• http://www.cs.rpi.edu
Overview
• What is Visual Web Mining
• Abstract
• Introduction
• Visual Web Mining Architecture
• Visual Representation
• Design and Implementation of diagrams
• Conclusion
What is Visual Web Mining
Application of Information visualization techniques on results of Web
Mining in order to further amplify the perception of extracted patterns
and visually explore new ones in web domain.
http://www.cs.rpi.edu/~youssefi/research/VWM/
Abstract
Analysis of web site usage data involves two significant challenges:
• Volume of data arising from the growth of the web.
• Structural complexity of web sites.
In this paper
• Applied Data Mining and Information Visualization techniques to the
web domain; in order to benefit from the power of both human visual
perception and computing.
• Applied Data Mining techniques to large web data sets and use
Information Visualization methods on the results.
GOAL:
- To correlate the outcomes of mining Web Usage Logs and the
extracted Web Structure, by visually superimposing the results.
Introduction
• Information Visualization
Visual representations of abstract data, using computer-supported,
interactive visual interfaces to reinforce human cognition; thus
enabling the viewer to gain knowledge about the internal structure of
the data and relationships in it.
• Visual Web Mining Framework
Provides a prototype implementation for applying information
visualization techniques to the results of Data Mining.
• User Session
Compact sequence of web accesses by a user.
• Visualization in order to:
- Understand the structure of a particular website.
- Web surfers’ behavior when visiting that website.
• Due to the large dataset and the structural complexity of the sites,
3D visual representations are used.
• Implemented using an open source toolkit called the Visualization
Tool Kit (VTK).
- VTK consists of a C++ class library and several interpreted
interface layers including Tcl/Tk, Java, and Python.
http://www.vtk.org/
Visual Web Mining Architecture
Visual Web Mining Architecture
• Input:
- Web pages and Web server log files.
- web robot (webbot) is used to retrieve the pages of the website.
- The webbot is a very fast Web walker with support for regular
expressions, SQL logging facilities, and many other features. It can be
used to check links, find bad HTML, map out a web site, download
images, etc.
• In parallel, Web Server Log files are downloaded and processed
through a sessionizer and a LOGML file is generated.
• The Integration Engine is a suite of programs for data preparation, i.e.,
extracting, cleaning, transforming and integrating data and finally
loading into database and later generating graphs in XGML.
http://www.w3.org/Robot/
Visual Web Mining Architecture
• User sessions from web logs are extracted, which yields results roughly related to a
specific user.
• User sessions are then converted into a special format for Sequence Mining using
cSPADE (continues Spade - Sequential PAttern Discovery Using Equivalent Class).
• Outputs:
- Frequent contiguous sequences with a given minimum support.
- These are imported into a database, and non-maximal frequent sequences are
removed.
- Different queries are executed against this data according to some criterion, e.g.
support of each pattern, length of patterns, etc.
- Different URLs which correspond to the same webpage are unified in the final
results.
• The Visualization Stage: Maps the extracted data and attributes into visual images,
realized through VTK extended with support for graphs.
• Result: Interactive 3D/2D visualizations which could be used by analysts to compare
actual web surfing patterns with expected patterns.
Visual Representation
Structures :
- Graphs
Extract spanning tree from the site structure, and use this as the
framework for presenting access-related results through glyphs(an
element of writing) and color mapping.
- Stream Tubes
Variable-width tubes showing access paths with different traffic are
introduced on top of the web graph structure.
Design and Implementation of
Diagrams
This is a visualization of the web graph
of the Computer Science department of
Rensselaer Polytechnic Institute.
Strahler numbers are used for assigning
colors to edges.
One can see user access paths
scattering from first page of website
(the node in center) to cluster of web
pages corresponding to faculty pages,
course home pages, etc.
2D visualization layout with Strahler
Coloring applied on web usage logs
Strahler numbers is a numerical measure of the branching complexity for assigning colors
to the edges.
http://www.cs.rpi.edu
Adding third dimension enables
visualization of more information and
clarifies user behavior in and between
clusters. Center node of circular
basement is first page of web site
from which users scatter to different
clusters of web pages. Color spectrum
from Red (entry point into clusters) to
Blue (exit points) clarifies behavior of
users.
The cylinder like part of this figure is
visualization of web usage of surfers
as they browse a long HTML
document.
3D visualization layout with Strahler
Coloring applied on web usage logs
Left: One can observe long user sessions as strings falling off. Those are special type of long
sessions when user navigates sequence of web pages which come one after the other e.g.,
sections of a long document. In many cases were found web pages with many nodes connected
with Next/Up/Previous hyperlinks.
Right: An enlarged view of the same visualization.
Frequent access patterns
extracted by the web mining
process are visualized as a
white graph on top of an
embedded and colorful graph
of web usage.
Superimposition of Frequent Patterns
extracted from Web Mining on top of Web
Usage
Similar to last picture with
addition of another attribute,
i.e., frequency of pattern which
is rendered as thickness of white
tubes.
This helps in the analysis of
results.
Thickness of the tubes represents
frequency of found patterns
Superimposition of Web Usage on top of Web
Structure with higher order layout.
Top node is the first page of the website.
Hierarchical output of layouts make analysis
easier.
Higher Order layout for clear
visualization and easier analysis
Left: Superimposition of website dynamics(colored) on top of its static structure(gray)
Right: Zoom view of colored region with layout of Web Usage taken from Web Graph
basement. The basement itself is removed for clarity
Conclusion
- Using the visualizations, a web analyzer can easily identify which parts of the
website are cold parts with few hits and which parts are hot ones with many
hits and classify them accordingly.
This also paves way for making exploratory changes in website.
- For e.g., adding links from hot parts of web site to cold parts and then
extracting, visualizing and interpreting changes in access patterns.
SPADE OVERVIEW
• An algorithm based on Apriori for fast discovery of frequent sequences
• Needs three database scans in order to extract sequential patterns
• Given: A database of customer transactions, each of which having the
following characteristics: sequence-id or customer-id, transaction-time and
the item involved in the transaction.
• The aim is to obtain typical behaviors according to the user's viewpoint.
User’s browsing access pattern is
amplified by a different coloring
Depending on link structure of
underlying pages, we can see vertical
access patterns of a user drilling down
the cluster, making a cylinder shape.
Also users following links going down a
hierarchy of webpages makes a cone
shape and users going up hierarchies,
e.g., back to main page of website
makes a funnel shape.
Amplification of a user session: Clickstream(Bottom Left) in
drill down cylinder, Cone Scatter(Top Right) and Funnel
Backoff to main page of website (Top Right)