KEMBAR78
Apache mahout and R-mining complex dataobject | PPTX
Apache mahout and
R -mining complex
data object
B.Sakthibala
I.M.sc(cs)
Apache mahout
Apache Mahout is a project of the Apache Software
Foundation to produce free implementations of distributed
or otherwise scalable machine learning algorithms focused
primarily on linear algebra.
the past, many of the implementations use the Apache
Hadoop platform, however today it is primarily focused on
Apache Spark.[3][4]
Mahout also provides Java/Scala libraries for common
maths operations (focused on linear algebra and statistics)
and primitive Java collections. Mahout is a work in proIn
gress; a number of algorithms have been implemented.[5]
Apache Mahout is an open source
project that is primarily used for creating
scalable machine learning algorithms. It
implements popular machine learning
techniques such as:
● Recommendation
● Classification
● Clustering
Apache Mahout started as a sub-project
of Apache’s Lucene in 2008. In 2010,
Mahout became a top level project of
Apache.
Features of Mahout
The primitive features of Apache Mahout are listed below.
The algorithms of Mahout are written on top of Hadoop, so it works well in
distributed environment. Mahout uses the Apache Hadoop library to scale
effectively in the cloud.
● Mahout offers the coder a ready-to-use framework for doing data mining tasks
on large volumes of data.
● Mahout lets applications to analyze large sets of data effectively and in quick
time.
● Includes several MapReduce enabled clustering implementations such as k-
means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.
● Supports Distributed Naive Bayes and Complementary
Naive Bayes classification implementations.
● Comes with distributed fitness function capabilities for
evolutionary programming.
● Includes matrix and vector libraries.
Applications of Mahout:
● Companies such as Adobe, Facebook,
LinkedIn, Foursquare, Twitter, and Yahoo
use Mahout internally.
● Foursquare helps you in finding out places,
food, and entertainment available in a
particular area. It uses the recommender
engine of Mahout.
● Twitter uses Mahout for user interest
modelling.
● Yahoo! uses Mahout for pattern mining.
Mahout machine
learning:
Machine learning is a branch of
science that deals with
programming the systems in such a
way that they automatically learn
and improve with experience. Here,
learning means recognizing and
understanding the input data and
making wise decisions based on
the supplied data.
The developed algorithms form the basis
of various applications such as:
● Vision processing
● Language processing
● Forecasting (e.g., stock market trends)
● Pattern recognition
● Games
● Data mining
● Expert systems
● Robotics
Machine learning is a vast area and it
is quite beyond the scope of this
tutorial to cover all its features. There
are several ways to implement
machine learning techniques,
however the most commonly used
ones are supervised and
unsupervised learning.
Supervised
learning
Un supervised
learning
Story for illustration purposes only
Supervised Learning:
Supervised learning deals with learning a function from
available training data. A supervised learning algorithm
analyzes the training data and produces an inferred
function, which can be used for mapping new examples.
Common examples of supervised learning include:
● classifying e-mails as spam,
● labeling webpages based on their
content, and
● voice recognition.
There are many supervised learning
algorithms such as neural networks,
Support Vector Machines (SVMs), and
Naive Bayes classifiers. Mahout
implements Naive Bayes classifier.
● k-means
● self-organizing maps, and
● hierarchical clustering
Unsupervised learning:
Unsupervised learning makes
sense of unlabeled data without
having any predefined dataset for
its training. Unsupervised learning
is an extremely powerful tool for
analyzing available data and look
for patterns and trends. It is most
commonly used for clustering
similar input into logical groups.
Common approaches to
unsupervised learning
Classification:
Classification, also known as categorization,
is a machine learning technique that uses
known data to determine how the new data
should be classified into a set of existing
categories. Classification is a form of
supervised learning.
Recommendation:
Recommendation is a popular technique that provides close
recommendations based on user information such as previous purchases,
clicks, and ratings.
● Amazon uses this technique to display a list of recommended items
that you might be interested in, drawing information from your past
actions. There are recommender engines that work behind Amazon to
capture user behavior and recommend selected items based on your
earlier actions.
● Facebook uses the recommender technique to identify and recommend
the “people you may know list”.
Classification:
Classification, also known as categorization, is a machine
learning technique that uses known data to determine how
the new data should be classified into a set of existing
categories. Classification is a form of supervised learning.
● Mail service providers such as Yahoo! and Gmail use this technique to decide
whether a new mail should be classified as a spam. The categorization
algorithm trains itself by analyzing user habits of marking certain mails as
spams. Based on that, the classifier decides whether a future mail should be
deposited in your inbox or in the spams folder.
● iTunes application uses classification to prepare playlists.
Clustering:
Clustering is used to form groups or clusters of similar data based on common
characteristics. Clustering is a form of unsupervised learning.
● Search engines such as Google and Yahoo! use clustering techniques to
group data with similar characteristics.
● Newsgroups use clustering techniques to group various articles based on
related topics.
The clustering engine goes through the input data completely and based on the
characteristics of the data, it will decide under which cluster it should be grouped.
Take a look at the following example.
Mining complex
data object
Mining time-series and sequence
data
Mining the World-Wide Web
Mining spatial databases
Mining multimedia databases
Summary
Mining Complex Types of
Data
Mining Time-Series and
Sequence Data
Time-series database
Consists of sequences of values or events
changing with time
Data is recorded at regular intervals
Characteristic time-series components
Trend, cycle, seasonal, irregular
Applications
Financial: stock price, inflation
Biomedical: blood pressure
Meteorological: precipitation
Mining the World-Wide Web
The WWW is huge, widely distributed,
global information service center for
Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources for data mining
Challenges
Too huge for effective data warehousing and
data mining
Too complex and heterogeneous: no standards
and structure
Web search engines
Index-based: search the Web, index
Web pages, and build and store huge
keyword-based indices
Help locate sets of Web pages
containing certain keywords
Deficiencies
A topic of any breadth may easily
contain hundreds of thousands of
documents
Many documents that are highly
relevant to a topic may not contain
keywords defining them
Web Mining: A more
challenging task
Searches for
Web access patterns
Web structures
Regularity and dynamics of Web contents
Problems
The “abundance” problem
Limited coverage of the Web: hidden Web
sources, majority of data in DBMS
Limited query interface based on keyword-
oriented search
Limited customization to individual users
Web Mining
Taxonomy
Web Mining
Web Structure
Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
Mining the
World-Wide
Web
Web Mining
Web Structure
Mining
Web Page Content Mining
Web Page Summarization
WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …:
Web Structuring query languages;
Can identify information within given web pages
•Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages
•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
Web Content
Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
Design of a Web Log Miner
Web log is filtered to generate a relational database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Mining the World-Wide
Web
Apache mahout and R-mining complex dataobject

Apache mahout and R-mining complex dataobject

  • 1.
    Apache mahout and R-mining complex data object B.Sakthibala I.M.sc(cs)
  • 2.
    Apache mahout Apache Mahoutis a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark.[3][4] Mahout also provides Java/Scala libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in proIn gress; a number of algorithms have been implemented.[5]
  • 3.
    Apache Mahout isan open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: ● Recommendation ● Classification ● Clustering Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top level project of Apache.
  • 4.
    Features of Mahout Theprimitive features of Apache Mahout are listed below. The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud. ● Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. ● Mahout lets applications to analyze large sets of data effectively and in quick time. ● Includes several MapReduce enabled clustering implementations such as k- means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.
  • 5.
    ● Supports DistributedNaive Bayes and Complementary Naive Bayes classification implementations. ● Comes with distributed fitness function capabilities for evolutionary programming. ● Includes matrix and vector libraries.
  • 6.
    Applications of Mahout: ●Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally. ● Foursquare helps you in finding out places, food, and entertainment available in a particular area. It uses the recommender engine of Mahout. ● Twitter uses Mahout for user interest modelling. ● Yahoo! uses Mahout for pattern mining.
  • 7.
    Mahout machine learning: Machine learningis a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data.
  • 8.
    The developed algorithmsform the basis of various applications such as: ● Vision processing ● Language processing ● Forecasting (e.g., stock market trends) ● Pattern recognition ● Games ● Data mining ● Expert systems ● Robotics
  • 9.
    Machine learning isa vast area and it is quite beyond the scope of this tutorial to cover all its features. There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning. Supervised learning Un supervised learning Story for illustration purposes only
  • 10.
    Supervised Learning: Supervised learningdeals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include:
  • 11.
    ● classifying e-mailsas spam, ● labeling webpages based on their content, and ● voice recognition. There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.
  • 12.
    ● k-means ● self-organizingmaps, and ● hierarchical clustering Unsupervised learning: Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for patterns and trends. It is most commonly used for clustering similar input into logical groups. Common approaches to unsupervised learning
  • 13.
    Classification: Classification, also knownas categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning.
  • 14.
    Recommendation: Recommendation is apopular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings. ● Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions. ● Facebook uses the recommender technique to identify and recommend the “people you may know list”.
  • 15.
    Classification: Classification, also knownas categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning.
  • 16.
    ● Mail serviceproviders such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder. ● iTunes application uses classification to prepare playlists.
  • 18.
    Clustering: Clustering is usedto form groups or clusters of similar data based on common characteristics. Clustering is a form of unsupervised learning. ● Search engines such as Google and Yahoo! use clustering techniques to group data with similar characteristics. ● Newsgroups use clustering techniques to group various articles based on related topics. The clustering engine goes through the input data completely and based on the characteristics of the data, it will decide under which cluster it should be grouped. Take a look at the following example.
  • 19.
  • 20.
    Mining time-series andsequence data Mining the World-Wide Web Mining spatial databases Mining multimedia databases Summary Mining Complex Types of Data
  • 21.
    Mining Time-Series and SequenceData Time-series database Consists of sequences of values or events changing with time Data is recorded at regular intervals Characteristic time-series components Trend, cycle, seasonal, irregular Applications Financial: stock price, inflation Biomedical: blood pressure Meteorological: precipitation
  • 22.
    Mining the World-WideWeb The WWW is huge, widely distributed, global information service center for Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources for data mining Challenges Too huge for effective data warehousing and data mining Too complex and heterogeneous: no standards and structure
  • 23.
    Web search engines Index-based:search the Web, index Web pages, and build and store huge keyword-based indices Help locate sets of Web pages containing certain keywords Deficiencies A topic of any breadth may easily contain hundreds of thousands of documents Many documents that are highly relevant to a topic may not contain keywords defining them
  • 24.
    Web Mining: Amore challenging task Searches for Web access patterns Web structures Regularity and dynamics of Web contents Problems The “abundance” problem Limited coverage of the Web: hidden Web sources, majority of data in DBMS Limited query interface based on keyword- oriented search Limited customization to individual users
  • 25.
    Web Mining Taxonomy Web Mining WebStructure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking
  • 26.
    Mining the World-Wide Web Web Mining WebStructure Mining Web Page Content Mining Web Page Summarization WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …: Web Structuring query languages; Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages •ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages Web Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking
  • 27.
    Design of aWeb Log Miner Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge Mining the World-Wide Web