Data Mining
(DM)
          Lecture 1: Introduction
Introduction
INSTRUCTOR, STUDENTS AND COURSE
Course Description
This course will provide a comprehensive introduction to the data
mining process; build theoretical and conceptual foundations of key
data mining tasks such as item set mining and clustering; discuss
analysis and implementation of algorithms; and introduce major sub-
areas such as text and web mining.
    Textbook(s)/Supplementary Readings
Data Mining: Concepts and Techniques,
   J. Han, M. Kamber, and J. Pei,
   Third Edition, Morgan Kaufmann Publishers, 2012.
Reference:
 Web Data Mining, B. Liu, Springer, 2006.
 Introduction to Information Retrieval, C. Manning et al., Cambridge
  University Press, Available Online, 2008.  
 Introduction to Data Mining, V. Tan et al. Addison-Wesley, 2009.
  Tools and Technologies: Weka
     Grading Policy
     Instrument        Description                                          Weight
     Class Exercises   In-class exercises and evaluation
     Assignments       Assigned during important stages of the course to
                       apply and practice the learnt concepts
                                                                            20%
     Project and       One group project
     presenation
     Quizzes           In-class (un)announced 15 minutes tests
     Mid-Term Exam     A single 90-minute exam from the material            20%
                       covered during the first 6-7 weeks
     Final Exam        Will cover the entire course. At least 75% of the
                                                                            60%
                       material would be post mid term.
Late Submission Policy: Late penalty is 10% per day for maximum of 2 days
Lets Start!
WHAT IS DATA MINING AND WHY DO WE
NEED IT?
  *Slides edited from Han and Kamber’s online lecture slides
Think this world of data
deeply
What is data?
What is database?
What is Data warehouse?
Cont..
What is Big Data? (3 V’s )
What is data ware house?
What is Information?
What is Knowledge?
Why we need Knowledge?
Why Data Mining?
Why Data Mining?
The Explosive Growth of Data: from terabytes to peta-bytes
 ◦ Data collection and data availability
   ◦ Automated data collection tools, database systems, Web,
     computerized society
 ◦ Major sources of abundant data
   ◦ Business: Web, e-commerce, transactions, stocks, …
   ◦ Science: Remote sensing, bioinformatics, scientific simulation, …
   ◦ Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
                                                                12
What is Data Mining?
Definition
Data mining (knowledge discovery from data)
 ◦ Extraction of interesting (non-trivial, implicit, previously
   unknown and potentially useful) patterns or knowledge from
   huge amount of data.
 ◦ Process of semi‐automatically automatically analyzing large
   databases to find patterns that are:
  ◦ valid: hold on new data with some certainty
  ◦ novel: non‐obvious to the system
  ◦ useful : should be possible to act on the item
  ◦ understandable: humans should be able to interpret the pattern
What Is Data Mining?
Alternative names
 ◦ Knowledge discovery (mining) in databases (KDD), knowledge
   extraction, data/pattern analysis, data archeology, data dredging,
   information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
◦ Simple search and query processing
 ◦ (Deductive) expert systems
                                                                  15
Key Properties of data mining
A. Automatic discovery of patterns
B. Prediction of likely outcomes
C. Creation of actionable information
D. Focus on large datasets and databases
Data Mining General Process
Knowledge Discovery (KDD) Process
      This is a view from typical database
      systems and data warehousing
      communities
      Data mining plays an essential role in   Pattern Evaluation
      the knowledge discovery process
                                      Data Mining
                      Task-relevant Data
       Data Warehouse           Selection
 Data Cleaning
            Data Integration
          Databases
                                                                    18
Architecture of Data Mining
Data Mining in Business Intelligence
                                                                          End User
  Increasing potential              Decision
  to support
                                    Making
  business decisions
                                Data Presentation                         Business
                                                                           Analyst
                             Visualization Techniques
                                 Data Mining                                    Data
                                Information Discovery                         Analyst
                                  Data Exploration
                  Statistical Summary, Querying, and Reporting
                  Data Preprocessing/Integration, Data Warehouses
                                                                                DBA
                                Data Sources
      Paper, Files, Web documents, Scientific experiments, Database Systems
                                                                                20
KDD Process: A Typical View from ML and Statistics
 Input Data          Data Pre-            Data                  Post-
                    Processing           Mining              Processing
    Data integration             Pattern discovery               Pattern evaluation
    Normalization                Association & correlation       Pattern selection
    Feature selection            Classification                  Pattern interpretation
                                 Clustering
    Dimension reduction                                          Pattern visualization
                                 Outlier analysis
                                 …………
   This is a view from typical machine learning and statistics communities
                                                                                          21
Classification of Data Mining
Systems
The data mining system can be classified according to the following
criteria:
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
Some Other Classification Criteria:
• Classification according to kind of databases mined: relational,
  transactional, object- relational, or data warehouse mining system.
• Classification according to kind of knowledge mined: Characterization,
  Discrimination , Association and Correlation Analysis , Classification ,Prediction ,
  Clustering, Outlier Analysis, Evolution Analysis
• Classification according to kinds of techniques utilized: techniques
  according to degree of user interaction involved or the methods of analysis
  employed.
• Classification according to applications adapted: Finance,
   Telecommunications, DNA, Stock Markets, E-mail
Major Issues In Data
Mining:
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge.
• Data mining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• Handling noisy or incomplete data.
• Pattern evaluation
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, and incremental mining algorithms.
Data warehouse
A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in
support of management's decision making process.
Data warehouse Process
Data Warehouse Models
1. Enterprise warehouse: An enterprise warehouse collects all of the
information about subjects spanning the entire organization. It provides
corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
2. Data mart: A data mart contains a subset of corporate-wide data that is
of value to a specific group of users. The scope is confined to specific
selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts
tend to be summarized.
3. Virtual warehouse: A virtual warehouse is a set of views over
operational databases. For efficient query processing, only some of
the possible summary views may be materialized.
Meta Data Repository
•   Metadata are data about data. In a data warehouse, metadata
    are the data that define warehouse objects.
Must Contain:
•   Data names, definitions, timestamping and the source of the
    extracted data, and missing field, warehouse schema, view,
    dimensions, hierarchies, algorithms used for summarization
•   Operational metadata, which include data lineage (history of
    migrated data and the sequence of transformations applied
    to it), currency of data (active, archived, or purged), and
    monitoring information (warehouse usage statistics, error
    reports, and audit trails).