Unit-I Data Mining
Unit-I Data Mining
Data mining refers to extracting or mining knowledge from large amountsof data. The term is
actually a misnomer. Thus, data miningshould have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
        Classification – is the task of generalizing known structure to apply to new data. For
       example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
       "spam".
       Regression – attempts to find a function which models the data with the least error.
       Summarization – providing a more compact representation of the data set, including
       visualization and report generation.
A typical data mining system may have the following major components.
1. Knowledge Base:
           This is the domain knowledge that is used to guide the search orevaluate the
          interestingness of resulting patterns. Such knowledge can include concepthierarchies,
   used to organize attributes or attribute values into different levels of abstraction.
   Knowledge such as user beliefs, which can be used to assess a pattern’s
   interestingness based on its unexpectedness, may also be included. Other examples of
   domain knowledge are additional interestingness constraints or thresholds, and
   metadata (e.g., describing data from multiple heterogeneous sources).
   This is essential to the data mining systemand ideally consists ofa set of functional
   modules for tasks such as characterization, association and correlationanalysis,
   classification, prediction, cluster analysis, outlier analysis, and evolutionanalysis.
   This component typically employs interestingness measures interacts with the data
   mining modules so as to focus thesearch toward interesting patterns. It may use
   interestingness thresholds to filterout discovered patterns. Alternatively, the pattern
   evaluation module may be integratedwith the mining module, depending on the
   implementation of the datamining method used. For efficient data mining, it is highly
   recommended to pushthe evaluation of pattern interestingness as deep as possible into
   the mining processso as to confine the search to only the interesting patterns.
4. User interface:
   Thismodule communicates between users and the data mining system,allowing the
   user to interact with the system by specifying a data mining query ortask, providing
   information to help focus the search, and performing exploratory datamining based on
   the intermediate data mining results. In addition, this componentallows the user to
   browse database and data warehouse schemas or data structures,evaluate mined
   patterns, and visualize the patterns in different forms.
Data Mining Process:
Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
         1. State the problem and formulate the hypothesis
         This step is concerned with how the data are generated and collected. In general, there are
         two distinct possibilities. The first is when the data-generation process is under the
         control of an expert (modeler): this approach is known as a designed experiment. The
         second possibility is when the expert cannot influence the data- generation process: this is
         known as the observational approach. An observational setting, namely, random data
         generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
of the results.
In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not
    consistent with most observations. Commonly, outliers result from measurement
    errors, coding and recording errors, and, sometimes, are natural, abnormal values.
    Such nonrepresentative samples can seriously affect the model produced later. There
    are two strategies for dealing with outliers:
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.
The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.
In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using highdimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific
       techniques to validate the results. A user does not want hundreds of pages of numeric
       results. He does not understand them; he cannot summarize, interpret, and use them for
       successful decision making.
The data mining system can be classified according to the following criteria:
       Database Technology
       Statistics
       Machine Learning
       Information Science
       Visualization
       Other Disciplines
Some Other Classification Criteria:
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
   Characterization
   Discrimination
       Association and Correlation Analysis
       Classification
   Prediction
   Clustering
       Outlier Analysis
       Evolution Analysis
Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.
We can classify the data mining system according to application adapted. These applications are
as follows:
Finance
   Telecommunications
   DNA
       Stock Markets
       E-mail
 Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
 Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations should
be easily understandable by the users.
 Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack novelty.
  Efficiency and scalability of data mining algorithms. - In order to effectively extract the
 information from huge amount of data in databases, data mining algorithm must be efficient
 and scalable.
  Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
 databases, wide distribution of data,and complexity of data mining methods motivate the
 development of parallel and distributed data mining algorithms. These algorithm divide the
 data into partitions which is further processed parallel. Then the results from the partitions is
 merged. The incremental algorithms, updates databases without having mine the data again
 from scratch.
Knowledge Discovery in Databases(KDD)
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:
       Data Cleaning - In this step the noise and inconsistent data is removed.
       Data Integration - In this step multiple data sources are combined.
       Data Selection - In this step relevant to the analysis task are retrieved from the database.
       Data Transformation - In this step data are transformed or consolidated into forms
       appropriate for mining by performing summary or aggregation operations.
       Data Mining - In this step intelligent methods are applied in order to extract data
       patterns.
       Pattern Evaluation - In this step, data patterns are evaluated.
       Knowledge Presentation - In this step,knowledge is represented.
The following diagram shows the process of knowledge discovery process:
Architecture of KDD
Data Warehouse:
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
     The top-down approach starts with the overall design and planning. It is useful in cases
    where the technology is mature and well known, and where the business problems that must
    be solved are clear and well understood.
     The bottom-up approach starts with experiments and prototypes. This is useful in the early
    stage of business modeling and technology development. It allows an organization to move
    forward at considerably less expense and to evaluate the benefits of the technology before
    making significant commitments.
     In the combined approach, an organization can exploit the planned and strategic nature of
    the top-down approach while retaining the rapid implementation and opportunistic
    application of the bottom-up approach.
The warehouse design process consists of the following steps:
     Choose a business process to model, for example, orders, invoices, shipments, inventory,
    account administration, sales, or the general ledger. If the business process is organizational
    and involves multiple complex object collections, a data warehouse model should be
    followed. However, if the process is departmental and focuses on the analysis of one kind of
    business process, a data mart model should be chosen.
    Choose the grain of the business process. The grain is the fundamental, atomic level of data
    to be represented in the fact table for this process, for example, individual transactions,
    individual daily snapshots, and so on.
     Choose the dimensions that will apply to each fact table record. Typical dimensions are
    time, item, customer, supplier, warehouse, transaction type, and status.
    Choose the measures that will populate each fact table record. Typical measures are numeric
    additive quantities like dollars sold and units sold.
A Three Tier Data Warehouse Architecture:
Tier-1:
      The bottom tier is a warehouse database server that is almost always a relationaldatabase
      system. Back-end tools and utilities are used to feed data into the bottomtier from
      operational databases or other external sources (such as customer profileinformation
      provided by external consultants). These tools and utilities performdataextraction,
      cleaning, and transformation (e.g., to merge similar data from differentsources into a
      unified format), as well as load and refresh functions to update thedata warehouse . The
      data are extracted using application programinterfaces known as gateways. A gateway is
       supported by the underlying DBMS andallows client programs to generate SQL code to
       be executed at a server.
       Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
       Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database
       Connection).
       This tier also contains a metadata repository, which stores information aboutthe data
       warehouse and its contents.
Tier-2:
       The middle tier is an OLAP server that is typically implemented using either a relational
       OLAP (ROLAP) model or a multidimensional OLAP.
Tier-3:
       The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models:
1. Enterprise warehouse:
       An enterprise warehouse collects all of the information about subjects spanning the entire
       organization.
       It provides corporate-wide data integration, usually from one or more operational systems
       or external information providers, and is cross-functional in scope.
       It typically contains detailed data aswell as summarized data, and can range in size from
       afew gigabytes to hundreds of gigabytes, terabytes, or beyond.
       An enterprise data warehouse may be implemented on traditional mainframes, computer
       superservers, or parallel architecture platforms. It requires extensive business modeling
       and may take years to design and build.
2. Data mart:
       A data mart contains a subset of corporate-wide data that is of value to aspecific group of
       users. The scope is confined to specific selected subjects. For example,a marketing data
       mart may confine its subjects to customer, item, and sales. Thedata contained in data
       marts tend to be summarized.
3. Virtual warehouse:
Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
warehouse. Additional metadata are created and captured fortimestamping any extracted data,
the source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.
         A description of the structure of the data warehouse, which includes the warehouse
       schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
       locations and contents.
         Operational metadata, which include data lineage (history of migrated data and the
       sequence of transformations applied to it), currency of data (active, archived, or purged),
       and monitoring information (warehouse usage statistics, error reports, and audit trails).
          The algorithms used for summarization, which include measure and dimension
       definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
       summarization,and predefined queries and reports.
          The mapping from the operational environment to the data warehouse, which
       includessource databases and their contents, gateway descriptions, data partitions, data
       extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
       andsecurity (user authorization and access control).
       Data related to system performance, which include indices and profiles that improvedata
       access and retrieval performance, in addition to rules for the timing and scheduling of
       refresh, update, and replication cycles.
    Consolidation (Roll-Up)
    Drill-Down
  Slicing And Dicing
    Consolidation involves the aggregation of data that can be accumulated and computed in
    one or more dimensions. For example, all sales offices are rolled up to the sales
    department or sales division to anticipate sales trends.
    The drill-down is a technique that allows users to navigate through the details. For
    instance, users can view the sales by individual products that make up a region’s sales.
    Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data
    of the OLAP cube and view (dicing) the slices from different viewpoints.
Types of OLAP:
        ROLAP works directly with relational databases. The base data and the dimension
       tables are stored as relational tables and new tables are created to hold the aggregated
       information. It depends on a specialized schema design.
        This methodology relies on manipulating the data stored in the relational database to
       give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
       each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
       SQL statement.
        ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
       standard relational database and its tables in order to bring back the data required to
       answer the question.
        ROLAP tools feature the ability to ask any question because the methodology does
       not limit to the contents of a cube. ROLAP also has the ability to drill down to the
       lowest level of detail in the database.
2. Multidimensional OLAP (MOLAP):
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
     MOLAP stores this data in an optimized multi-dimensional array storage, rather than
     in a relational database. Therefore it requires the pre-computation and storage of
     information in the cube - the operation known as processing.
     MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
     The data cube contains all the possible answers to a given range of questions.
     MOLAP tools have a very fast response time and the ability to quickly write back
     data into the data set.
     There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
     except that a database will divide data between relational and specialized storage.
     For example, for some vendors, a HOLAP database will use relational tables to hold
     the larger quantities of detailed data, and use specialized storage for at least some
     aspects of the smaller quantities of more-aggregate or less-detailed data.
      HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
     capabilities of both approaches.
     HOLAP tools can utilize both pre-calculated cubes and relational data sources.
Data Preprocessing:
Data Integration:
It combines datafrom multiple sources into a coherent data store, as in data warehousing. These
sourcesmay include multiple databases, data cubes, or flat files.
       How can the data analyst or the computer be sure that customer id in one database and
       customer number in another reference to the same attribute.
2. Redundancy:
For the same real-world entity, attribute values fromdifferent sources may differ.
Data Transformation:
In data transformation, the data are transformed or consolidated into forms appropriatefor
mining.
            Smoothing, which works to remove noise from the data. Such techniques
       includebinning, regression, and clustering.
          Aggregation, where summary or aggregation operations are applied to the data. For
       example, the daily sales data may be aggregated so as to compute monthly and
       annualtotal amounts. This step is typically used in constructing a data cube for analysis of
       the data at multiple granularities.
        Generalization of the data, where low-level or ―primitive‖ (raw) data are
       replaced byhigher-level concepts through the use of concept hierarchies. For
       example, categoricalattributes, like street, can be generalized to higher-level
       concepts, like city or country.
         Normalization, where the attribute data are scaled so as to fall within a small
       specifiedrange, such as 1:0 to 1:0, or 0:0 to 1:0.
             Attribute construction (or feature construction),wherenewattributes are
       constructedand added from the given set of attributes to help the mining process.
Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of thedata set
that ismuch smaller in volume, yet closely maintains the integrity of the originaldata. That is,
mining on the reduced data set should be more efficient yet produce thesame (or almost the
same) analytical results.
Strategies for data reduction include the following:
         Data cube aggregation, where aggregation operations are applied to the data in
       theconstruction of a data cube.
          Attribute subset selection, where irrelevant, weakly relevant, or redundant
       attributesor dimensions may be detected and removed.
         Dimensionality reduction, where encoding mechanisms are used to reduce the
       dataset size.
         Numerosityreduction,where the data are replaced or estimated by alternative,
       smallerdata representations such as parametric models (which need store only the
       modelparameters instead of the actual data) or nonparametric methods such as
       clustering,sampling, and the use of histograms.
          Discretization and concept hierarchy generation,where rawdata values for
       attributesare replaced by ranges or higher conceptual levels. Data discretization is a
       form ofnumerosity reduction that is very useful for the automatic generation of
       concept hierarchies.Discretization and concept hierarchy generation are powerful
       tools for datamining, in that they allow the mining of data at multiple levels of
abstraction.