KEMBAR78
Web Topics | PPT
World Wide Web Presented By Bharath Praveen Swathi
World Wide Web The World Wide Web was created in 1989 by Tim Berners-Lee, working at the European Organization for Nuclear Research (CERN) in Geneva, Switzerland and released in 1992  Web - Accessing information over internet It is not Internet – Network of networks Email (SMTP), File sharing (FTP) System of interlinked documents Browser / Web Browser The first Web browser, written by Tim Berners Lee and introduced in early 1991 ran on NeXT
Architecture
URI, URN & URL < URI > := < scheme>  : < scheme-specific-part > Difference between URL, URN, and URI:  URL:  http://www.tmrf.org/kpr/issue1.htm URN:  www.tmrf.org/kpr/issue1.htm#one URI:  http://www.tmrf.org/kpr/issue1.htm#one
Web Protocols ARP:  Address Resolution Protocol DHCP:  Dynamic Host Configuration Protocol DNS:  Domain Name Service DSN:  Data Source Name FTP:  File Transfer Protocol HTTP:  Hypertext Transfer Protocol IMAP:  Internet Message Access Protocol ICMP:  Internet Control Message Protocol IDRP:  ICMP Router-Discovery Protocol IP:  Internet Protocol IRC:  Internet Relay Chat Protocol POP3:  Post Office Protocol version 3 PAR:  Positive Acknowledgment and Retransmission RLOGIN:  Remote Login SMTP:  Simple Mail Transfer Protocol SSL:  Secure Sockets Layer SSH:  Secure Shell TCP:  Transmission Control Protocol TELNET:  TCP/IP Terminal Emulation Protocol UPD:  User Datagram Protocol UPS:  Uninterruptible Power Supply
HTTP Hyper Text Transfer Protocol HTTP 1.1  Persistent connections Pipelining Cache validation commands Request Types: GET, POST, PUT, HEAD, DELETE, TRACE, OPTIONS, CONNECT
Request & Response Request GET POST
Languages used Client Side HTML, CSS, Javascript, AJAX, Flex3 Server Side .NET (Asp.net, VB.net, c#.net) Java (JSP, Servlets, Plain java class)  CGI Perl / PHP Other Languages Ada 95, Applescript, BEF & Dylan (similar to PASCAL), CCI (Common Client Interface) , CMM, Guile, Hypertalk, Icon, KQML (Knowledge Query and Manipulation language), Linda, Lingo, Lisp, ML, Modula 3, Obliq, Phantom, Python, ReXX, ScriptX, SDI (Software Development Interface),VRML
Web 2.0 AJAX Reverse AJAX Democracy (Wiki, reddit, digg, youtube) RIA SOA Mashups Widgets Feeds, RSS, Web services Blogging Tagging
Ajax Architecture
Ajax Technologies Associated XHTML & CSS for presentation DOM to interact with data XML & XSLT for interchange and manipulation of data XMLHttpRequest object for asynchronous communication Javascript to integrate all the above technologies Advantages Fast, No reload, updates the section of a page Disadvantages Actions are not registered with browser’s history Need an alternate way to be indexed JavaScript must be enabled on the browser Server load
Reverse AJAX Server pushes data to all alive clients DWR Direct Web Remoting
Mashups Mixing multiple service together to produce new Types: Data & Enterprise mashups Tools: Microsoft Popfly, Yahoo Pipes, Google Mashup editor
Widgets UWA Universal Widget API from NetVibes
Feeds – RSS, JSON, Atom
Web 3.0 The Data Web  making data as openly accessible and linkable as Web pages  Querying for data across distributed RDF databases Semantic web
Open Social A common API for social applications across multiple websites Supports interoperability with other social networks that support them Core Services: People & Friends, Activities, Persistence Platforms: google, hi5, myspace, Imeem HTML, JavaScript, REST, OAUTH
Summary Making the web more social Current version 0.7 Orkut, MySpace, hi5, Netlog, Imeem, Linkedin Easy to get data Apache Shindig: to host open source applications
Semantic Web Introduction History Architecture Challenges Future Conclusion Logo of Semantic Web
What is Semantic Web ? Meaningful representation of data on World Wide Web  Processed by humans as well as machines in global scale
Why do we need Semantic Web ? Enhanced Search and Discovery Enhanced System and Data Interoperability Knowledge Management Semantic Web Service Electronic Commerce
History 1989 – Vision of Tim-Berners Lee 1994 – Presented at first WWW conference 2002 – Architecture
Architecture Source: Lee, T. B. Semantic Web - XML2000 – Architecture. Retrieved July 11, 2008 from  http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html
Unicode and URI Unicode – International standard for encoding text Ex: UTF-8, UTF-16 URI – Universal Resource Identifier Uniform Resource Locator (URL) Identify resources via a representation of their primary access mechanism  Ex: http://seal.ifi.unizh.ch Universal Resource Name (URN) Globally unique and persistent even when the resource ceases to exist or becomes unavailable. Ex: urn:ISBN:0-395-36341-1
XML and Namespace eXtensible Markup Language Stores data in related entities Provides standard for storage layout and logical structure Supports syntactic interoperability Namespace Elements and attributes have expanded names Expanded name = Namespace name + Local name Namespace name – name holding URI XML Schema
RDF – Resource Description Framework Language for representing metadata of web resources Framework for exchange of information between applications without loss of meaning
RDF Model Resource - Thing being described by RDF expression Property - Specific aspect, characteristic, attribute, or relation used to describe a resource.  Statement - A specific resource + a named property + the value of that property for that resource  Represented as 3-tuple – Subject, Predicate and Object Ex: http://www.example.org/index.html  has a creator called John Smith
RDF Model - Example Source: Manola, Miller, McBride (2004, February). The RDF Primer .  W3C Recommendations.
RDF Model – Example (Contd…) Source: Manola, Miller, McBride (2004, February). The RDF Primer .   W3C Recommendations.
Why RDF and not just XML ? Many XML trees for single 3-tuple XML parser cannot distinguish subject, object and property RDF model – direct, unambiguous and decentralized
Why RDF and not just XML ? (Contd…) Example 3-tuple (index.html, John Smith, author) Relationship: Index.html has author John Smith <?xml version=&quot;1.0&quot;?> <rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf- syntax-ns#&quot;  xmlns:exterms=&quot;http://www.example.org/terms/&quot;>   <rdf:Description rdf:about=&quot;http://www.example.org/index.html&quot;>  <exterms:creator>John Smith</exterms:creator>  </rdf:Description> </rdf:RDF>
Why RDF and not just XML ? (Contd…) Possible XML trees <author>  <uri>  Index.html  </uri>  <name>John Smith</name>  </author>   <document href=&quot;  Index.html  &quot;> <author> John Smith </author>  </document> <document> <details>  <uri>href=&quot;  Index.html  &quot;</uri> <author> <name> John Smith </name> </author>  </details> </document>  or maybe <document>  <author>  <uri>href=&quot;  Index.html  &quot;</uri>  <details> <name> John Smith </name> </details>  </author>  </document>
RDF Schema (RDFS) Collection of classes authored for specific purpose or domain Classes organized in hierarchy Describes inheritance hierarchies, class schemas, properties, domain and range and restriction for properties  Supports extensibility and reusability Multiple views of same metadata
RDFS - Example <Class ID=“Animal”> <Class ID=&quot;Male&quot;>  <subclass Ofresource=&quot;#Animal&quot;/> </Class>  <Class ID=&quot;Female&quot;>  <subclass Ofresource=&quot;#Animal&quot;/> <disjointFrom resource=&quot;#Male&quot;/> </Class>
Web Ontology Language (OWL) Extends from RDFS Specifies axioms based on the classes of entities, their properties and relationships Draw inference based on axioms
OWL (Contd…) Source: Lee, T. B. Semantic Web Road map. Retrieved July 11, 2008 from  http://www.w3.org/DesignIssues/Semantic.html
Challenges Standardizing Semantic Web Stack Developing Ontologies Converting existing WWW into Semantic Web Capturing Cultural Semantics Interoperability Issues
Some News… SPARQL Protocol Semantic Search Engines – Google, Yahoo, Intelliseek Jena Semantic Web Toolkit – HP Joseki Web API – HP Wilbur – Nokia
What is Cloud Computing? An emerging computing paradigm where data and services reside in massively scalable data centers and can be ubiquitously accessed from any connected devices over the internet. 4+ billion phones by 2010 [Source: Nokia] Web 2.0-enabled PCs, TVs, etc.
Characteristics of Cloud Computing Virtual  – Physical location and underlying infrastructure details are transparent to users  Scalable  – Able to break complex workloads into pieces to be served across an incrementally expandable infrastructure Efficient  – Services Oriented Architecture for dynamic provisioning of shared compute resources Flexible  – Can serve a variety of workload types – both consumer and commercial
Cloud Computing Building Blocks  A massively scalable and flexible computing platform of the future,   built on IBM and open source software, for hosting Web 2.0 and SOA applications. Business Benefits Cost efficient model for creating  and acquiring information services Removes or reduces IT management complexity Increases business responsiveness with real-time capacity reallocation Powers rich internet applications Enabling Technologies Open source Linux platform Xen open source systems virtualization Automated provisioning of computing resources by Tivoli Provisioning Manager Systems management and monitoring by IBM Tivoli Monitoring Parallel computing clusters using Apache Hadoop  Open source Eclipse-based development tools for parallel applications
Cloud Computing Architecture IBM  Monitoring v.6 DB2 Provisioning Management Stack Provisioning  Manager v.5.1 WebSphere  Application Server Monitoring Provisioning Baremetal & Xen VM Open Source Linux with Xen Tivoli Monitoring Agent Virtualized Infrastructure based on Open Source Linux & Xen Virtual Machine Virtual Machine Virtual Machine Virtual Machine Data Center – System x Apache
Examples of Cloud Computing Workloads Web 2.0 applications Software to scan voluminous Wikipedia edits to identify spam Organize global news articles by geographic location Data-intensive workloads based on scalable architectures. Next generation rich media, such as virtual worlds, streaming videos, etc. New services can be created and published via a completely integrated Eclipse-based environment
Joint IBM Google Announcement IBM Almaden Research Universities participating in initial pilot Train future workforce with next generation computing  skills  University initiative to promote open standards and emerging parallel computing model Jointly provide compute platform of the future including hardware, software, and services to support new parallel computing curricula  Three active “clouds” Google U. Of Washington
Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Why is Web Information Retrieval Important? Research Health/Medicine Travel Business Entertainment Arts
Why is Web Information Retrieval Difficult? The Abundance Problem  Hundreds of irrelevant documents returned in response to a search query. Limited Coverage of the Web  Largest crawlers cover less than 18% of Web pages The Web is extremely dynamic 􀂄  Lots of pages added, removed and changed every day 􀂄  Very high dimensionality (thousands of dimensions) 􀂄  Limited query interface based on keyword-oriented search 􀂄  Limited customization to individual users
Web Mining Taxonomy Web Mining Web Usage  Mining Web Structure  Mining Web Content  Mining
Web Mining Taxonomy Web content mining :  focuses on techniques for  assisting a user in finding documents that meet a certain criterion (text mining) Web structure mining :  aims at developing techniques to take advantage of the collective judgment of web page quality which is available in the form of hyperlinks Web usage mining :  focuses on techniques to study the user behavior when navigating the web (also known as Web log mining and clickstream analysis)
Web Content Mining Can be thought of as extending the work performed by basic search engines. Search engines have  crawlers  to search the web and gather information,  indexing techniques  to store the information, and  query processing support  to provide information to the users Web Content Mining is: the process of extracting knowledge from web contents
Semi-Structured Data Content is, in general,  semi-structured Example: 􀂄  Title 􀂄  Author 􀂄  Publication_Date  Structured attribute/value pairs 􀂄  Length 􀂄  Category 􀂄  Abstract  Unstructured   􀂄  Content
Text Mining Document classification Document clustering Key-word based association rules
Web Structure Mining Early days: keyword based searches Keywords: “web mining” Retrieves documents with “web” and mining” Later on: cope with 􀂄  Synonymy problem 􀂄  Polysemy problem 􀂄  stop words Modern search engines use link structure as  important source of information
Central Question: Which useful information can be  derived from the link structure of the web?
Some Answers 1. Structure of Internet 2. Google 3. HITS: Hubs and Authorities
General Structure of the Web
Google Search engine that uses link structure to calculate a quality ranking (PageRank) for each page Intuition: PageRank can be seen as the probability that a “random surfer” visits a page Keywords  CMPE272  entered by user Select pages containing  CMPE272  and pages which have in-links with caption  CMPE272 . Font sizes of words in text: Words in larger or bolder font are assigned  higher weights.
HITS (hyperlink-Induced Topic Search) HITS uses hyperlink structure to identify authoritative Web sources for broad-topic information discovery Premise: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: 􀂄  Authorities: highly-referenced pages on a topic 􀂄  Hubs: pages that “point” to authorities A good authority is pointed to by many good hubs; a good hub points to many good authorities
Hubs and Authorities Hub pages point to interesting links to authorities = relevant pages Authorities are targets of hub pages
Web Usage Mining Pages contain information Links are “roads” How do people navigate over the Internet? ⇒  Web usage mining (Clickstream Analysis) Information on navigation paths are logged.
Web Usage Analysis
Data Sources
Web Usage Mining Process
Data Preparation Data cleaning 􀂄  By checking the suffix of the URL name, for example, all log  entries with filename suffixes such as, gif, jpeg, etc User identification 􀂄  If a page is requested that is not directly linked to the  previous pages, multiple users are assumed to exist on the  same machine 􀂄  Other heuristics involve using a combination of IP address,  machine name, browser agent, and temporal information to  identify users Transaction identification 􀂄  All of the page references made by a user during a single visit  to a site 􀂄  Size of a transaction can range from a single page reference to  all of the page references
References - Web Bryan Basham, Kathy Sierra, & Bert Bates. (2008). Head first servlets and JSP Oreilly & Associates Inc. Dan Harkey, Robert Orfali, & Jeri Edwards. Client/Server survival guide (Third ed.) Wiley. Open social. (2008).  http://www.opensocial.org/ Praveen, A. (2008). Job quest mashup.http://praveen.987mb.com/Projects/JobDashBoard/HTML/JobQuest.html Wikipedia. (2008). http://en.wikipedia.org/wiki/Main_Page
References – Semantic Web Lee, T. B. (1998, September). Semantic Web Road map. Retrieved July 11, 2008 from  http://www.w3.org/DesignIssues/Semantic.html Lee, T. B. Semantic Web - XML2000. Retrieved July 11, 2008 from  http://www.w3.org/2000/Talks/1206-xml2k-tbl/Overview.html Manola, Miller, McBride (2004, February). The RDF Primer .  W3C Recommendations. Lee, T. B. (1998, September). Why RDF model is different from the XML model. Retrieved July 11, 2008 from  http://www.w3.org/DesignIssues/RDF-XML.html W3C. (1999, January). Resource Description Framework (RDF) Model and Syntax Specification. Retrieved July 11, 2008 from  http://www.w3.org/TR/PR-rdf-syntax/ Palmer, S., B. (1999, September). The Semantic Web: An introduction. Retrieved July 11, 2008 from  http://infomesh.net/2001/swintro/#itWorks
References www.umass.edu/research/rld/iln/uploads/ Cloud %20 Computing %20Oct%2003%20Ext. ppt en.wikipedia.org/wiki/ Cloud _ computing infolab.stanford.edu/~ullman/ mining /2006/lectureslides/ web %20 mining %20overview.pdf www.cs.uic.edu/~liub/ Web Content Mining .html en.wikipedia.org/wiki/ Web _ mining

Web Topics

  • 1.
    World Wide WebPresented By Bharath Praveen Swathi
  • 2.
    World Wide WebThe World Wide Web was created in 1989 by Tim Berners-Lee, working at the European Organization for Nuclear Research (CERN) in Geneva, Switzerland and released in 1992 Web - Accessing information over internet It is not Internet – Network of networks Email (SMTP), File sharing (FTP) System of interlinked documents Browser / Web Browser The first Web browser, written by Tim Berners Lee and introduced in early 1991 ran on NeXT
  • 3.
  • 4.
    URI, URN &URL < URI > := < scheme> : < scheme-specific-part > Difference between URL, URN, and URI: URL: http://www.tmrf.org/kpr/issue1.htm URN: www.tmrf.org/kpr/issue1.htm#one URI: http://www.tmrf.org/kpr/issue1.htm#one
  • 5.
    Web Protocols ARP: Address Resolution Protocol DHCP: Dynamic Host Configuration Protocol DNS: Domain Name Service DSN: Data Source Name FTP: File Transfer Protocol HTTP: Hypertext Transfer Protocol IMAP: Internet Message Access Protocol ICMP: Internet Control Message Protocol IDRP: ICMP Router-Discovery Protocol IP: Internet Protocol IRC: Internet Relay Chat Protocol POP3: Post Office Protocol version 3 PAR: Positive Acknowledgment and Retransmission RLOGIN: Remote Login SMTP: Simple Mail Transfer Protocol SSL: Secure Sockets Layer SSH: Secure Shell TCP: Transmission Control Protocol TELNET: TCP/IP Terminal Emulation Protocol UPD: User Datagram Protocol UPS: Uninterruptible Power Supply
  • 6.
    HTTP Hyper TextTransfer Protocol HTTP 1.1 Persistent connections Pipelining Cache validation commands Request Types: GET, POST, PUT, HEAD, DELETE, TRACE, OPTIONS, CONNECT
  • 7.
    Request & ResponseRequest GET POST
  • 8.
    Languages used ClientSide HTML, CSS, Javascript, AJAX, Flex3 Server Side .NET (Asp.net, VB.net, c#.net) Java (JSP, Servlets, Plain java class) CGI Perl / PHP Other Languages Ada 95, Applescript, BEF & Dylan (similar to PASCAL), CCI (Common Client Interface) , CMM, Guile, Hypertalk, Icon, KQML (Knowledge Query and Manipulation language), Linda, Lingo, Lisp, ML, Modula 3, Obliq, Phantom, Python, ReXX, ScriptX, SDI (Software Development Interface),VRML
  • 9.
    Web 2.0 AJAXReverse AJAX Democracy (Wiki, reddit, digg, youtube) RIA SOA Mashups Widgets Feeds, RSS, Web services Blogging Tagging
  • 10.
  • 11.
    Ajax Technologies AssociatedXHTML & CSS for presentation DOM to interact with data XML & XSLT for interchange and manipulation of data XMLHttpRequest object for asynchronous communication Javascript to integrate all the above technologies Advantages Fast, No reload, updates the section of a page Disadvantages Actions are not registered with browser’s history Need an alternate way to be indexed JavaScript must be enabled on the browser Server load
  • 12.
    Reverse AJAX Serverpushes data to all alive clients DWR Direct Web Remoting
  • 13.
    Mashups Mixing multipleservice together to produce new Types: Data & Enterprise mashups Tools: Microsoft Popfly, Yahoo Pipes, Google Mashup editor
  • 14.
    Widgets UWA UniversalWidget API from NetVibes
  • 15.
    Feeds – RSS,JSON, Atom
  • 16.
    Web 3.0 TheData Web making data as openly accessible and linkable as Web pages Querying for data across distributed RDF databases Semantic web
  • 17.
    Open Social Acommon API for social applications across multiple websites Supports interoperability with other social networks that support them Core Services: People & Friends, Activities, Persistence Platforms: google, hi5, myspace, Imeem HTML, JavaScript, REST, OAUTH
  • 18.
    Summary Making theweb more social Current version 0.7 Orkut, MySpace, hi5, Netlog, Imeem, Linkedin Easy to get data Apache Shindig: to host open source applications
  • 19.
    Semantic Web IntroductionHistory Architecture Challenges Future Conclusion Logo of Semantic Web
  • 20.
    What is SemanticWeb ? Meaningful representation of data on World Wide Web Processed by humans as well as machines in global scale
  • 21.
    Why do weneed Semantic Web ? Enhanced Search and Discovery Enhanced System and Data Interoperability Knowledge Management Semantic Web Service Electronic Commerce
  • 22.
    History 1989 –Vision of Tim-Berners Lee 1994 – Presented at first WWW conference 2002 – Architecture
  • 23.
    Architecture Source: Lee,T. B. Semantic Web - XML2000 – Architecture. Retrieved July 11, 2008 from http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html
  • 24.
    Unicode and URIUnicode – International standard for encoding text Ex: UTF-8, UTF-16 URI – Universal Resource Identifier Uniform Resource Locator (URL) Identify resources via a representation of their primary access mechanism Ex: http://seal.ifi.unizh.ch Universal Resource Name (URN) Globally unique and persistent even when the resource ceases to exist or becomes unavailable. Ex: urn:ISBN:0-395-36341-1
  • 25.
    XML and NamespaceeXtensible Markup Language Stores data in related entities Provides standard for storage layout and logical structure Supports syntactic interoperability Namespace Elements and attributes have expanded names Expanded name = Namespace name + Local name Namespace name – name holding URI XML Schema
  • 26.
    RDF – ResourceDescription Framework Language for representing metadata of web resources Framework for exchange of information between applications without loss of meaning
  • 27.
    RDF Model Resource- Thing being described by RDF expression Property - Specific aspect, characteristic, attribute, or relation used to describe a resource. Statement - A specific resource + a named property + the value of that property for that resource Represented as 3-tuple – Subject, Predicate and Object Ex: http://www.example.org/index.html has a creator called John Smith
  • 28.
    RDF Model -Example Source: Manola, Miller, McBride (2004, February). The RDF Primer . W3C Recommendations.
  • 29.
    RDF Model –Example (Contd…) Source: Manola, Miller, McBride (2004, February). The RDF Primer . W3C Recommendations.
  • 30.
    Why RDF andnot just XML ? Many XML trees for single 3-tuple XML parser cannot distinguish subject, object and property RDF model – direct, unambiguous and decentralized
  • 31.
    Why RDF andnot just XML ? (Contd…) Example 3-tuple (index.html, John Smith, author) Relationship: Index.html has author John Smith <?xml version=&quot;1.0&quot;?> <rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf- syntax-ns#&quot; xmlns:exterms=&quot;http://www.example.org/terms/&quot;> <rdf:Description rdf:about=&quot;http://www.example.org/index.html&quot;> <exterms:creator>John Smith</exterms:creator> </rdf:Description> </rdf:RDF>
  • 32.
    Why RDF andnot just XML ? (Contd…) Possible XML trees <author> <uri> Index.html </uri> <name>John Smith</name> </author> <document href=&quot; Index.html &quot;> <author> John Smith </author> </document> <document> <details> <uri>href=&quot; Index.html &quot;</uri> <author> <name> John Smith </name> </author> </details> </document> or maybe <document> <author> <uri>href=&quot; Index.html &quot;</uri> <details> <name> John Smith </name> </details> </author> </document>
  • 33.
    RDF Schema (RDFS)Collection of classes authored for specific purpose or domain Classes organized in hierarchy Describes inheritance hierarchies, class schemas, properties, domain and range and restriction for properties Supports extensibility and reusability Multiple views of same metadata
  • 34.
    RDFS - Example<Class ID=“Animal”> <Class ID=&quot;Male&quot;> <subclass Ofresource=&quot;#Animal&quot;/> </Class>  <Class ID=&quot;Female&quot;> <subclass Ofresource=&quot;#Animal&quot;/> <disjointFrom resource=&quot;#Male&quot;/> </Class>
  • 35.
    Web Ontology Language(OWL) Extends from RDFS Specifies axioms based on the classes of entities, their properties and relationships Draw inference based on axioms
  • 36.
    OWL (Contd…) Source:Lee, T. B. Semantic Web Road map. Retrieved July 11, 2008 from http://www.w3.org/DesignIssues/Semantic.html
  • 37.
    Challenges Standardizing SemanticWeb Stack Developing Ontologies Converting existing WWW into Semantic Web Capturing Cultural Semantics Interoperability Issues
  • 38.
    Some News… SPARQLProtocol Semantic Search Engines – Google, Yahoo, Intelliseek Jena Semantic Web Toolkit – HP Joseki Web API – HP Wilbur – Nokia
  • 39.
    What is CloudComputing? An emerging computing paradigm where data and services reside in massively scalable data centers and can be ubiquitously accessed from any connected devices over the internet. 4+ billion phones by 2010 [Source: Nokia] Web 2.0-enabled PCs, TVs, etc.
  • 40.
    Characteristics of CloudComputing Virtual – Physical location and underlying infrastructure details are transparent to users Scalable – Able to break complex workloads into pieces to be served across an incrementally expandable infrastructure Efficient – Services Oriented Architecture for dynamic provisioning of shared compute resources Flexible – Can serve a variety of workload types – both consumer and commercial
  • 41.
    Cloud Computing BuildingBlocks A massively scalable and flexible computing platform of the future, built on IBM and open source software, for hosting Web 2.0 and SOA applications. Business Benefits Cost efficient model for creating and acquiring information services Removes or reduces IT management complexity Increases business responsiveness with real-time capacity reallocation Powers rich internet applications Enabling Technologies Open source Linux platform Xen open source systems virtualization Automated provisioning of computing resources by Tivoli Provisioning Manager Systems management and monitoring by IBM Tivoli Monitoring Parallel computing clusters using Apache Hadoop Open source Eclipse-based development tools for parallel applications
  • 42.
    Cloud Computing ArchitectureIBM Monitoring v.6 DB2 Provisioning Management Stack Provisioning Manager v.5.1 WebSphere Application Server Monitoring Provisioning Baremetal & Xen VM Open Source Linux with Xen Tivoli Monitoring Agent Virtualized Infrastructure based on Open Source Linux & Xen Virtual Machine Virtual Machine Virtual Machine Virtual Machine Data Center – System x Apache
  • 43.
    Examples of CloudComputing Workloads Web 2.0 applications Software to scan voluminous Wikipedia edits to identify spam Organize global news articles by geographic location Data-intensive workloads based on scalable architectures. Next generation rich media, such as virtual worlds, streaming videos, etc. New services can be created and published via a completely integrated Eclipse-based environment
  • 44.
    Joint IBM GoogleAnnouncement IBM Almaden Research Universities participating in initial pilot Train future workforce with next generation computing skills University initiative to promote open standards and emerging parallel computing model Jointly provide compute platform of the future including hardware, software, and services to support new parallel computing curricula Three active “clouds” Google U. Of Washington
  • 45.
    Web Mining Webmining is the use of data mining techniques to automatically discover and extract information from Web documents/services
  • 46.
    Why is WebInformation Retrieval Important? Research Health/Medicine Travel Business Entertainment Arts
  • 47.
    Why is WebInformation Retrieval Difficult? The Abundance Problem Hundreds of irrelevant documents returned in response to a search query. Limited Coverage of the Web Largest crawlers cover less than 18% of Web pages The Web is extremely dynamic 􀂄 Lots of pages added, removed and changed every day 􀂄 Very high dimensionality (thousands of dimensions) 􀂄 Limited query interface based on keyword-oriented search 􀂄 Limited customization to individual users
  • 48.
    Web Mining TaxonomyWeb Mining Web Usage Mining Web Structure Mining Web Content Mining
  • 49.
    Web Mining TaxonomyWeb content mining : focuses on techniques for assisting a user in finding documents that meet a certain criterion (text mining) Web structure mining : aims at developing techniques to take advantage of the collective judgment of web page quality which is available in the form of hyperlinks Web usage mining : focuses on techniques to study the user behavior when navigating the web (also known as Web log mining and clickstream analysis)
  • 50.
    Web Content MiningCan be thought of as extending the work performed by basic search engines. Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users Web Content Mining is: the process of extracting knowledge from web contents
  • 51.
    Semi-Structured Data Contentis, in general, semi-structured Example: 􀂄 Title 􀂄 Author 􀂄 Publication_Date Structured attribute/value pairs 􀂄 Length 􀂄 Category 􀂄 Abstract Unstructured 􀂄 Content
  • 52.
    Text Mining Documentclassification Document clustering Key-word based association rules
  • 53.
    Web Structure MiningEarly days: keyword based searches Keywords: “web mining” Retrieves documents with “web” and mining” Later on: cope with 􀂄 Synonymy problem 􀂄 Polysemy problem 􀂄 stop words Modern search engines use link structure as important source of information
  • 54.
    Central Question: Whichuseful information can be derived from the link structure of the web?
  • 55.
    Some Answers 1.Structure of Internet 2. Google 3. HITS: Hubs and Authorities
  • 56.
  • 57.
    Google Search enginethat uses link structure to calculate a quality ranking (PageRank) for each page Intuition: PageRank can be seen as the probability that a “random surfer” visits a page Keywords CMPE272 entered by user Select pages containing CMPE272 and pages which have in-links with caption CMPE272 . Font sizes of words in text: Words in larger or bolder font are assigned higher weights.
  • 58.
    HITS (hyperlink-Induced TopicSearch) HITS uses hyperlink structure to identify authoritative Web sources for broad-topic information discovery Premise: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: 􀂄 Authorities: highly-referenced pages on a topic 􀂄 Hubs: pages that “point” to authorities A good authority is pointed to by many good hubs; a good hub points to many good authorities
  • 59.
    Hubs and AuthoritiesHub pages point to interesting links to authorities = relevant pages Authorities are targets of hub pages
  • 60.
    Web Usage MiningPages contain information Links are “roads” How do people navigate over the Internet? ⇒ Web usage mining (Clickstream Analysis) Information on navigation paths are logged.
  • 61.
  • 62.
  • 63.
  • 64.
    Data Preparation Datacleaning 􀂄 By checking the suffix of the URL name, for example, all log entries with filename suffixes such as, gif, jpeg, etc User identification 􀂄 If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine 􀂄 Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users Transaction identification 􀂄 All of the page references made by a user during a single visit to a site 􀂄 Size of a transaction can range from a single page reference to all of the page references
  • 65.
    References - WebBryan Basham, Kathy Sierra, & Bert Bates. (2008). Head first servlets and JSP Oreilly & Associates Inc. Dan Harkey, Robert Orfali, & Jeri Edwards. Client/Server survival guide (Third ed.) Wiley. Open social. (2008). http://www.opensocial.org/ Praveen, A. (2008). Job quest mashup.http://praveen.987mb.com/Projects/JobDashBoard/HTML/JobQuest.html Wikipedia. (2008). http://en.wikipedia.org/wiki/Main_Page
  • 66.
    References – SemanticWeb Lee, T. B. (1998, September). Semantic Web Road map. Retrieved July 11, 2008 from http://www.w3.org/DesignIssues/Semantic.html Lee, T. B. Semantic Web - XML2000. Retrieved July 11, 2008 from http://www.w3.org/2000/Talks/1206-xml2k-tbl/Overview.html Manola, Miller, McBride (2004, February). The RDF Primer . W3C Recommendations. Lee, T. B. (1998, September). Why RDF model is different from the XML model. Retrieved July 11, 2008 from http://www.w3.org/DesignIssues/RDF-XML.html W3C. (1999, January). Resource Description Framework (RDF) Model and Syntax Specification. Retrieved July 11, 2008 from http://www.w3.org/TR/PR-rdf-syntax/ Palmer, S., B. (1999, September). The Semantic Web: An introduction. Retrieved July 11, 2008 from http://infomesh.net/2001/swintro/#itWorks
  • 67.
    References www.umass.edu/research/rld/iln/uploads/ Cloud%20 Computing %20Oct%2003%20Ext. ppt en.wikipedia.org/wiki/ Cloud _ computing infolab.stanford.edu/~ullman/ mining /2006/lectureslides/ web %20 mining %20overview.pdf www.cs.uic.edu/~liub/ Web Content Mining .html en.wikipedia.org/wiki/ Web _ mining