MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF
TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY
 DATA MINING AND KNOWLEDGE DISCOVERY
                               Halefom Tekle
                     Friday, February 5, 2021
Outlines
                        Chapter 1: Definition
 Non-trivial extraction of implicit, previously unknown and
 potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic
 means, of large quantities of data in order to discover
 meaningful patterns
   What is not Data mining?    What is Data Mining?
    Look up phone number in     Certain names are more prevalent
     phone directory              in certain US locations (O’Brien,
                                  O’Rurke, O’Reilly… in Boston
    Query a Web search           area)
     engine for information
     about “Amazon”              Group together similar documents
                                  returned by search engine
                                  according to their context (e.g.
                                  Amazon rainforest, Amazon.com,)
                              Con.
 Data mining is a technique for discovering interesting
 patterns from data
 Data mining also kwon as knowledge discovery from data.
 It is a multi-disciplinary field involving
   Machine learning
   Statistics
   Databases
   Artificial intelligence
   Information retrieval, and
   Visualization
    1.1 Why Data Mining? Commercial view
 We live in a world where vast amounts of data are
 collected daily.
 Lots of data is being collected and warehoused
   Web data, e-commerce
   purchases at department/grocery stores
   Bank/Credit Card transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
   Provide better, customized services for an edge (e.g. in Customer
    Relationship Management)
                       1.3 Motivation
 There is often information “hidden” in the data that is
 not readily evident
 Human analysts may take weeks to discover useful information
 Much of the data is never analyzed at all
 1.4 Data Mining as the Evolution of Information
                   Technology
 Data mining can be viewed as a result of the natural evolution of
 information technology.
 Those are
   Data collection and database creation
   Database management system
   Advanced database system
   Advanced data analysis
 The early development of data collection and database creation
 mechanisms served as a prerequisite for the later development of
 effective mechanisms for data storage and retrieval, as well as query
 and transaction processing.
 Nowadays numerous database systems offer query and transaction
 processing as common practice.
 Advanced data analysis has naturally become the next step.
Con.
                          Con.
                                                                             ata
                                                                           d
                                                                         is or.
                                                                     r ld po
                                                                   wo on
                                                                h e ati
                                                            s, t rm
                                                         a n nfo
                                                        e ti
                                                      m
                                                   his h bu
                                                  T ric
So, we need tools to extract the valuable knowledge
embedded in the vast amounts of data to help decision
maker’s intuition .
                            Con.
Data mining
 Is the process of discovering interesting patterns and
 knowledge from large amounts of data.
 Many people treat data mining as a synonym for another
 popularly used term, knowledge discovery from data, or
 KDD, while others view data mining as merely an
 essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
 the Web, other information repositories, or data that are
 streamed into the system dynamically.
 The knowledge discovery process is an iterative sequence
                           Con.
 Pre-processing:
     The raw data is usually not suitable for mining due to
     various reasons.
 Data mining:
    The processed data is then fed to a data mining
     algorithm which will produce patterns or knowledge.
 Post-processing:
    In many applications, not all discovered patterns are
     useful. This step identifies those useful ones for
     applications. Various evaluation and visualization
     techniques are used to make the decision.
                               Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
  retrieved from the database
4. Data transformation: where data are transformed and consolidated
  into forms appropriate for mining by performing summary or
  aggregation operations
5. Data mining: an essential process where intelligent methods are
  applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
  representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
  representation techniques are used to present mined knowledge to
  users
   1.5 What Kinds of Data Can Be Mined?
 Data mining can be applied to any kind of data as long as the data
 are meaningful for a target application.
 The most basic forms of data for mining applications are
   Database data
   Data warehouse data
   Transactional data
 Can also be applied to other forms of data
   data streams
   ordered/sequence data
   graph or networked data
   text data
   multimedia data (audio, video, image)
   and WWW
                           Con.
1.5.1 Database data
 Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
  annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
  supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
  commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
  paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
                                    Con.
 Database data
      Relational data can be accessed by database queries written in a
       relational query (SQL, PostgreeSQL, …) or
      With the assistance of graphical user interfaces.
 The mining task is
    prediction methods
      Predict the credit risk of new customers
      To use some variables to predict unknown or future values of
       other variables.
      detect deviations—that is, items with sales that are far from
       those expected in comparison with the previous year
   Description Methods
        Find human-interpretable patterns that describe the data.
                            Con.
 Classification
 Regression            Predictive
 Deviation Detection
 Clustering
 Association Rule Discovery     Descriptive
 Sequential Pattern Discovery
                           Con.
1.5.2 Data warehouse
 Is a repository of multiple heterogeneous data sources
  organized under a unified schema at a single site to
  facilitate management decision making.
 Data warehouse technology includes data cleaning, data
 integration, and online analytical processing (OLAP)
 OLAP—is analysis techniques with functionalities such
 as summarization, consolidation, and aggregation, as well
 as the ability to view information from different angles.
                               Con.
 Although OLAP tools support multidimensional analysis and
 decision making, additional data analysis tools are required
 for in-depth analysis—for example, data mining tools that
 provide data classification, clustering, outlier/anomaly
 detection, and the characterization of changes in data over
 time.
 A data warehouse is usually modeled by a multidimensional
 data structure, called a data cube, in which each dimension
 corresponds to an attribute or a set of attributes in the schema,
 and each cell stores the value of some aggregate measure such
 as count or sum (sales_amount).
 A data cube provides a multidimensional view of data and
 allows the precomputation and fast access of summarized data.
                           Con.
 Let AllElectronics had a data warehouse
                                 Con.
1.5.3 Transactional Data
 Transactional database captures a transaction, such as a
  customer’s purchase, a flight booking, or a user’s clicks on a
  web page.
 A transaction typically includes
    a unique transaction identity number (trans ID) and
    a list of the items making up the transaction, such as the items
     purchased in the transaction.
 A transactional database may have additional tables, which
  contain other information related to the transactions
    such as item description,
    information about the salesperson or the branch, and so on.
1.6 What Kinds of Patterns Can Be Mined?
 There are a number of data mining functionalities. These include
      Characterization and discrimination
      Mining of frequent patterns, associations, and correlations
      Classification and regression
      Clustering analysis
      Outlier analysis
 Data mining functionalities are used to specify the kinds of patterns to
 be found in data mining tasks.
 Such tasks can be classified into two categories:
     Descriptive and
     Predictive.
 Descriptive mining tasks characterize properties of the data in a target
 data set.
 Predictive mining tasks perform induction on the current data in order
 to make predictions.
                                  Con.
1.6.1 Class/Concept Description: Characterization and Discrimination
 Data entries can be associated with classes or concepts.
 For example, in the AllElectronics store, classes of items for sale
  include computers and printers, and concepts of customers include
  bigSpenders and budgetSpenders.
 It can be useful to describe individual classes and concepts in
  summarized, concise, and yet precise terms.
 Such descriptions of a class or a concept are called class/concept
  descriptions.
 These descriptions can be derived using
   Data characterization, by summarizing the data of the class under study
    (often called the target class) in general terms
   Data discrimination, by comparison of the target class with one or a set of
    comparative classes (often called the contrasting classes) or
   both data characterization and discrimination.
                                         Con.
1.6.2 Mining Frequent Patterns, Associations, and
  Correlations
 Frequent patterns, as the name suggests, are patterns that
  occur frequently in data.
 There are many kinds of frequent patterns
    Frequent itemsets
        a set of items that often appear together in a transactional data set, milk
         and bread
    Frequent      subsequences (also known as sequential patterns)
            tend to purchase first a laptop, followed by a digital camera, and then a
             memory card
    Frequent substructures.
        can refer to different structural forms (e.g., graphs, trees, or lattices) that
         may be combined with itemsets or subsequences.
                                        Con.
 Mining frequent patterns leads to the discovery of interesting
 associations and correlations within data.
 Association analysis.
    Suppose that, as a marketing manager at AllElectronics, you want to
    know which items are frequently purchased together (i.e., within the
    same transaction).
    Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
    confidence = 50%],
        single-dimensional association rules (buys).
    Age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
    [support = 2%, confidence = 60%],
        multidimensional association rule (Age, income, buys).
 Typically, association rules are discarded as uninteresting if they
  do not satisfy both a minimum support threshold and a minimum
  confidence threshold.
                               Con.
1.6.3 Classification and Regression for Predictive Analysis
 Classification (na¨ıve Bayesian, SVM, and KNN)
    Is the process of finding a model (or function) that describes and
    distinguishes data classes or concepts.
    The model are derived based on the analysis of a set of training
    data (i.e., data objects for which the class labels are known).
    The model is used to predict the class label of objects for which
    the class label is unknown.
    It predicts categorical (discrete, unordered) labels
 Regression analysis
   Is a statistical methodology that is most often used for
    numeric prediction
   It predicts continuous-valued
Con.
                             Con.
1.6.4 Cluster Analysis
 Unlike classification and regression, which analyze class-
  labeled (training) data sets.
 Clustering analyzes data objects without consulting class
  labels.
 In many cases, classlabeled data may simply not exist at the
  beginning.
 Clustering can be used to generate class labels for a group of
  data.
 The objects are clustered or grouped based on the principle of
  maximizing the intraclass similarity and minimizing the
  interclass similarity.
Con.
                            Con.
1.6.5 Outlier Analysis
 A data set may contain objects that do not comply with the
  general behavior or model of the data.
 These data objects are outliers.
 Many data mining methods discard outliers as noise or
  exceptions.
 However, in some applications (e.g., fraud detection) the rare
  events can be more interesting than the more regularly
  occurring ones
1.7 Which Technologies Are Used?
                                    Con.
 A statistical model
       Is a set of mathematical functions that describe the behavior of the
        objects in a target class in terms of random variables and their
        associated probability distributions.
 Machine Learning
     Machine learning investigates how computers can learn (or improve
      their performance) based on data.
     A main research area is for computer programs to automatically learn
      to recognize complex patterns and make intelligent decisions based on
      data.
     learning methods
           Supervised
           Unsupervised
           Semi-supervised
           Reinforcement
    Which Kinds of Applications Are Targeted?
 Business Intelligence
    Organization commercial context
        customers, the market, supply and resources, and
        competitors
        provide historical, current, and predictive views of business
        operations
 Web Search Engines
   Have to handle with
        a huge and ever-growing amount of data
        online data
        queries that are asked only a very small number of times
 Bioinformatics and health informatics
 Finance, digital libraries, and digital governments.
            1.8 Major Issues in Data Mining
 Mining Methodology
     Mining various and new kinds of knowledge
     Mining knowledge in multidimensional space
     Data mining—an interdisciplinary effort
     Boosting the power of discovery in a networked environment
 User Interaction
     Interactive mining
     Incorporation of background knowledge
     Ad hoc data mining and data mining query languages
     Presentation and visualization of data mining results
 Efficiency and Scalability
     Efficiency, scalability, performance, optimization, ability to execute in real time
     Parallel, distributed, and incremental mining algorithms
 Diversity of Database Types
     Handling complex types of data
     Mining dynamic, networked, and global data repositories
 Data Mining and Society
     Social impacts of data mining
     Privacy-preserving data mining
     Invisible data mining
                       Exercises
 How is a data warehouse different from a database? How are
 they similar?
 What are the major challenges of mining a huge amount of
 data (e.g., billions of tuples) in comparison with mining a
 small amount of data (e.g., data set of a few hundred tuple)?
 Define each of the following data mining functionalities:
 characterization, discrimi-nation, association and correlation
 analysis, classification, regression, clustering, and outlier
 analysis. Give examples of each data mining functionality,
 using a real-life database that you are familiar with.