KEMBAR78
Data mining with big data implementation | PPTX
Data Mining with Big Data
A Project Dissertation
By
Sandip B. Tipayle Patil
Roll No.: MT2013216
Under the guidance of
Prof. Y. N. Patil
Department of Computer Engineering,
Dr. Babasaheb Ambedkar Technological University,
Lonere - 402103, Dist. Raigad, (M.S.) INDIA.
Outlines
 Introduction
 What is Data mining and Big Data?
 How Much Data really Exist?
 Literature Review
 4Vs of Big Data
 System
 System Architecture
 Big Data mining Framework
 Hadoop Framework
 Big Data Challenges and solution
 Advantages
 Application implementation
 Conclusion
Introduction
Interesting Facts
 The volume of business data worldwide, across all companies, doubles every
1.2 years (was 1.5 years)
 Daily 2500 quadrillion of data are produced and more than 90 percentage of
data are produced within past two years.
 A regular person is processing daily more data than a 16th century individual
in his entire life
 In the last years cost of storage and processing power dropped significantly
 Bad data or poor data quality costs US businesses $600 billion annually
 Facebook processes 10 TB of data every day / Twitter 7 TB
 Google has over 3 million servers processing over 2 trillion searches per year
in 2012 (only 22 million in 2000)
What is …… ?
 Data Mining
 Big Data
What is Data Mining?
 Discovery of useful, possibly unexpected, patterns in data
 Non-trivial extraction of implicit,
 previously unknown and potentially useful information from data
 Exploration & analysis,
 by automatic or
semi-automatic means, of large quantities of data in order to discover
meaningful patterns
Data Mining Tasks
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
 Collaborative Filter [Predictive]
Decision Trees
8
sale custId car age city newCar
c1 taurus 27 sf yes
c2 van 35 la yes
c3 van 40 sf yes
c4 taurus 22 sf yes
c5 merc 50 la no
c6 taurus 25 la no
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign
training
set
What is
“Big Data is the frontier of a firm's ability to
store, process, and access (SPA) all the
data it needs to operate effectively, make
decisions, reduce risks, and serve
customers.”
-- Forrester
“Big Data is the frontier of a firm's ability to
store, process, and access (SPA) all the data
it needs to operate effectively, make
decisions, reduce risks, and serve
customers.”
-- Forrester
“Big data is the data characterized by 3
attributes: volume, variety and velocity.”
-- IBM
“Big data is the data characterized by 3
attributes: volume, variety and velocity.”
-- IBM
Big Data is not about the size of the data,
it’s about the value within the data.
What is Big Data ?
 The term Big data is used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process using
traditional database and software techniques
 Large data sets in terms of terabytes and petabytes
 Complex with different data types and formats
 ‘Big Data’ is similar to ‘small data’, but bigger
 …but having data bigger it requires different approaches:
 Techniques, tools and architecture
 …with an aim to solve new problems
 …or old problems in a better way
How much Data does exist?
 2.5 quintillion bytes of data are created EVERY DAY
 IBM: 90 percent of the data in the world today were produced
with past two years
 Forms of Data????
 Examples : Boing Jet, Scientific Data, Sensor Data, Internet
Data,
Literature Review
 Data has grown tremendously.
 This large amount of data is beyond the software tools to
manage.
 Exploring the large volume of data and extracting useful
information and knowledge is a challenge, and sometimes, it is
almost infeasible.
 Most people don’t know what to do with all data that they
already have
Structured vs Unstructured Data
Giant Elephant
 Huge Data with heterogeneous and diverse dimensionality
‣ represent huge volume of data
 Autonomous sources with distributed and decentralized control
‣ main characteristics of Big Data
 Complex and evolving relationships
4 Vs of Big Data
Velocity
• Data Speed
How is Big Data actually used?
Better understand and target customers:
 companies expand their traditional data sets with social media data,
browser, text analytics or sensor data to get a more complete picture
of their customers. The big objective, in many cases, is to create
predictive models.
 Using big data, Telecom companies can now better predict customer
churn; retailers can predict what products will sell, and car insurance
companies understand how well their customers actually drive.
Impact of Today
Activity Data
 Simple activities like listening to music or reading a book are now
generating data. Digital music players and eBooks collect data on
our activities.
 smart phone collects data on how you use it and your web browser
collects information on what you are searching for.
 credit card company collects data on where you shop and your shop
collects data on what you buy. It is hard to imagine any activity that
does not generate data.
Impact Of Today
Photo and Video Image Data
 the pictures we take on our smart phones or digital cameras. We
upload and share 100s of thousands of them on social media sites
every second.
 The increasing amounts of CCTV cameras take video images and
we up-load hundreds of hours of video images to YouTube and
other sites every minute .
Need Of process Data:
Gap due to Lack of analysis
Male, age 32
Lives in SF
Lawyer
Searched on
from London
last week
Searched on:
“Italian
restaurant
Palo Alto”
Checks Yahoo!
Mail daily via
PC & Phone
Has 25 IM Buddies,
Moderates 3 Y!
Groups, and hosts a
360 page viewed by
10k people
Searched on:
“Hillary Clinton”
Clicked on
Sony Plasma TV
SS ad
Registration Campaign Behavior Unknown
Spends 10 hour/week
On the internet Purchased Da
Vinci Code
from Amazon
“Classic” Data: e.g. Yahoo! User
DNA
Male, age 32
Lives in SF
Lawyer
Searched on from
London last week
Searched on:
“Italian
restaurant
Palo Alto”
Checks Yahoo! Mail
daily via PC & Phone
Has 25 IM Buddies,
Moderates 3 Y! Groups, and
hosts a 360 page viewed by
10k people
Searched on:
“Hillary Clinton”
Clicked on
Sony Plasma TV
SS ad
Spends 10 hour/week
On the internet Purchased Da Vinci
Code from Amazon
How Data Explodes: really big
Social Graph (FB)
Likes &
friends likes
Professional netwk
- reputation Web searches on
this person,
hobbies, work,
locationMetaData on everything
Blogs, publications,
news, local papers,
job info, accidents
System Description :
 Identify relationships between different idea
 Capable of handling Huge volume of Data
 Uses distributed parallel computing with help of Hadoop
 Provides platform for process data in different dimensions and summarized
results.
 system architecture is to be flexible enough that the components built on top
of it for expressing the various kinds of processing tasks can tune it to
efficiently run these different workloads.
 System will process these data within reasonable cost and time limits.
System Architecture:
Hadoop framework :
Big Data Mining framework
 Big Data Mining Platform
 Dig Data Semantics and Application Knowledge
I. Information Sharing and Data Privacy
II. Domain and Application Knowledge
 Big Data Mining Algorithm
I. Local Learning and Model Fusion for Multiple
Information Sources
II. mining from Sparse, Uncertain, and Incomplete Data
III. Mining Complex and Dynamic Data
Big Data mining Framework
Challenges
Location of Big Data sources- Commonly Big Data are
stored in different locations
Volume of the Big Data- size of the Big Data grows
continuously.
Hardware resources- RAM capacity
Privacy- Medical reports, bank transactions
Having domain knowledge
Getting meaningful information
Solutions
Parallel computing programming
An efficient platform for computing will not have
centralized data storage instead of that platform
will be distributed in big scale storage.
Restricting access to the data
New tools Like Hadoop, flume , sqoop ,R and pig
etc.
Application Implementation
Book Recomendation
User
Admin
registration
upload
Data Storage
Report
Hadoop Parallel processing
Admin Data flow :
OWNER View books Upload books display View ratings
Validates Data Validates Data Validates Data
Data StorageData StorageData Storage
Advantages:
 Fast response
 Extract useful information
 Prediction of required data from large amount of data.
 Savour of better results in the form of visualization.
Conclusion
 We have entered an era of Big Data. Through better analysis of the large
volumes of data that are becoming available, there is the potential for
making faster advances in many scientific and improving the profitability and
success of many enterprises by using technologies like hadoop ,pig and so on.
 This system will fully serviceable across a large variety of application
domains, and therefore not cost-effective to address in the context of one
domain alone.
 Furthermore, this system will provide fully transformative solutions, and will
be address naturally for the next generation of industrial applications.
 We must support and encourage this framework towards addressing these
technical challenges of unstructured data, if we are to achieve the promised
benefits of Big Data.
Data mining with big data implementation

Data mining with big data implementation

  • 1.
    Data Mining withBig Data A Project Dissertation By Sandip B. Tipayle Patil Roll No.: MT2013216 Under the guidance of Prof. Y. N. Patil Department of Computer Engineering, Dr. Babasaheb Ambedkar Technological University, Lonere - 402103, Dist. Raigad, (M.S.) INDIA.
  • 2.
    Outlines  Introduction  Whatis Data mining and Big Data?  How Much Data really Exist?  Literature Review  4Vs of Big Data  System  System Architecture  Big Data mining Framework  Hadoop Framework  Big Data Challenges and solution  Advantages  Application implementation  Conclusion
  • 3.
  • 4.
    Interesting Facts  Thevolume of business data worldwide, across all companies, doubles every 1.2 years (was 1.5 years)  Daily 2500 quadrillion of data are produced and more than 90 percentage of data are produced within past two years.  A regular person is processing daily more data than a 16th century individual in his entire life  In the last years cost of storage and processing power dropped significantly  Bad data or poor data quality costs US businesses $600 billion annually  Facebook processes 10 TB of data every day / Twitter 7 TB  Google has over 3 million servers processing over 2 trillion searches per year in 2012 (only 22 million in 2000)
  • 5.
    What is ……?  Data Mining  Big Data
  • 6.
    What is DataMining?  Discovery of useful, possibly unexpected, patterns in data  Non-trivial extraction of implicit,  previously unknown and potentially useful information from data  Exploration & analysis,  by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
  • 7.
    Data Mining Tasks Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive]  Collaborative Filter [Predictive]
  • 8.
    Decision Trees 8 sale custIdcar age city newCar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no Example: • Conducted survey to see what customers were interested in new model car • Want to select customers for advertising campaign training set
  • 9.
  • 10.
    “Big Data isthe frontier of a firm's ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.” -- Forrester
  • 11.
    “Big Data isthe frontier of a firm's ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.” -- Forrester
  • 12.
    “Big data isthe data characterized by 3 attributes: volume, variety and velocity.” -- IBM
  • 13.
    “Big data isthe data characterized by 3 attributes: volume, variety and velocity.” -- IBM
  • 14.
    Big Data isnot about the size of the data, it’s about the value within the data.
  • 15.
    What is BigData ?  The term Big data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques  Large data sets in terms of terabytes and petabytes  Complex with different data types and formats
  • 16.
     ‘Big Data’is similar to ‘small data’, but bigger  …but having data bigger it requires different approaches:  Techniques, tools and architecture  …with an aim to solve new problems  …or old problems in a better way
  • 17.
    How much Datadoes exist?  2.5 quintillion bytes of data are created EVERY DAY  IBM: 90 percent of the data in the world today were produced with past two years  Forms of Data????  Examples : Boing Jet, Scientific Data, Sensor Data, Internet Data,
  • 19.
    Literature Review  Datahas grown tremendously.  This large amount of data is beyond the software tools to manage.  Exploring the large volume of data and extracting useful information and knowledge is a challenge, and sometimes, it is almost infeasible.  Most people don’t know what to do with all data that they already have
  • 20.
  • 21.
  • 22.
     Huge Datawith heterogeneous and diverse dimensionality ‣ represent huge volume of data  Autonomous sources with distributed and decentralized control ‣ main characteristics of Big Data  Complex and evolving relationships
  • 23.
    4 Vs ofBig Data Velocity • Data Speed
  • 24.
    How is BigData actually used? Better understand and target customers:  companies expand their traditional data sets with social media data, browser, text analytics or sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models.  Using big data, Telecom companies can now better predict customer churn; retailers can predict what products will sell, and car insurance companies understand how well their customers actually drive.
  • 25.
    Impact of Today ActivityData  Simple activities like listening to music or reading a book are now generating data. Digital music players and eBooks collect data on our activities.  smart phone collects data on how you use it and your web browser collects information on what you are searching for.  credit card company collects data on where you shop and your shop collects data on what you buy. It is hard to imagine any activity that does not generate data.
  • 26.
    Impact Of Today Photoand Video Image Data  the pictures we take on our smart phones or digital cameras. We upload and share 100s of thousands of them on social media sites every second.  The increasing amounts of CCTV cameras take video images and we up-load hundreds of hours of video images to YouTube and other sites every minute .
  • 27.
    Need Of processData: Gap due to Lack of analysis
  • 28.
    Male, age 32 Livesin SF Lawyer Searched on from London last week Searched on: “Italian restaurant Palo Alto” Checks Yahoo! Mail daily via PC & Phone Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people Searched on: “Hillary Clinton” Clicked on Sony Plasma TV SS ad Registration Campaign Behavior Unknown Spends 10 hour/week On the internet Purchased Da Vinci Code from Amazon “Classic” Data: e.g. Yahoo! User DNA
  • 29.
    Male, age 32 Livesin SF Lawyer Searched on from London last week Searched on: “Italian restaurant Palo Alto” Checks Yahoo! Mail daily via PC & Phone Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people Searched on: “Hillary Clinton” Clicked on Sony Plasma TV SS ad Spends 10 hour/week On the internet Purchased Da Vinci Code from Amazon How Data Explodes: really big Social Graph (FB) Likes & friends likes Professional netwk - reputation Web searches on this person, hobbies, work, locationMetaData on everything Blogs, publications, news, local papers, job info, accidents
  • 30.
    System Description : Identify relationships between different idea  Capable of handling Huge volume of Data  Uses distributed parallel computing with help of Hadoop  Provides platform for process data in different dimensions and summarized results.  system architecture is to be flexible enough that the components built on top of it for expressing the various kinds of processing tasks can tune it to efficiently run these different workloads.  System will process these data within reasonable cost and time limits.
  • 31.
  • 32.
  • 33.
    Big Data Miningframework  Big Data Mining Platform  Dig Data Semantics and Application Knowledge I. Information Sharing and Data Privacy II. Domain and Application Knowledge  Big Data Mining Algorithm I. Local Learning and Model Fusion for Multiple Information Sources II. mining from Sparse, Uncertain, and Incomplete Data III. Mining Complex and Dynamic Data
  • 34.
    Big Data miningFramework
  • 35.
    Challenges Location of BigData sources- Commonly Big Data are stored in different locations Volume of the Big Data- size of the Big Data grows continuously. Hardware resources- RAM capacity Privacy- Medical reports, bank transactions Having domain knowledge Getting meaningful information
  • 36.
    Solutions Parallel computing programming Anefficient platform for computing will not have centralized data storage instead of that platform will be distributed in big scale storage. Restricting access to the data New tools Like Hadoop, flume , sqoop ,R and pig etc.
  • 37.
  • 38.
    Admin Data flow: OWNER View books Upload books display View ratings Validates Data Validates Data Validates Data Data StorageData StorageData Storage
  • 39.
    Advantages:  Fast response Extract useful information  Prediction of required data from large amount of data.  Savour of better results in the form of visualization.
  • 40.
    Conclusion  We haveentered an era of Big Data. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in many scientific and improving the profitability and success of many enterprises by using technologies like hadoop ,pig and so on.  This system will fully serviceable across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone.
  • 41.
     Furthermore, thissystem will provide fully transformative solutions, and will be address naturally for the next generation of industrial applications.  We must support and encourage this framework towards addressing these technical challenges of unstructured data, if we are to achieve the promised benefits of Big Data.