KEMBAR78
Intro 2 | PDF | Outlier | Data Mining
0% found this document useful (0 votes)
17 views3 pages

Intro 2

The document outlines the data mining process, emphasizing that it is an iterative and planned approach rather than a random application of tools. It details the steps involved, including problem formulation, data collection, preprocessing, model estimation, and interpretation. Additionally, it introduces the CRISP-DM framework, which standardizes the data mining lifecycle into six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

islaamam55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

Intro 2

The document outlines the data mining process, emphasizing that it is an iterative and planned approach rather than a random application of tools. It details the steps involved, including problem formulation, data collection, preprocessing, model estimation, and interpretation. Additionally, it introduces the CRISP-DM framework, which standardizes the data mining lifecycle into six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

islaamam55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

CIT- 652. DATA MINING.

COURSE INSTRUCTOR : Sheza Naeem

Lecture#2

Data mining process :-

Data mining is a process of discovering various models, Summaries and derived values from a given
collection of data.
The word process is very important here even in some professional environment, there is a belief that
data mining simply consisting on picking and applying a complete based tool to match presented problem and
automatically obtaining a solution. This is misconception based on our artificial idealization of the world. there are
several reasons why this is incorrect. Firstly, data mining is not only just collection of tools and secondly, it lies in the
notion of matching a problem to technique. it is hardly happens that a problem match technique. in fact data mining
is iterative process. we have to examine the problem, Decide to apply some tools and techniques. sometimes need
modification if it doesn't match, have to go to beginning and restart the process.
data mining is not a random application of statistical, machine learning or any other tool. it is not a
random walk through the space of analytic technique. but a carefully planned and considered process of deciding
what will be most useful processing and revealing.
Any general experimental procedure adapted to data mining problem involves following steps
1. state the problem and formulate the hypothesis
2. collect data
3. preprocessing that data
4. estimate the model
5. interpret the model and draw conclusion.

1. State the problem and formulate the hypothesis:


In this step, a modeler (expert) Usually specifies a group of variables for unknown
dependency and, if possible, a general sort of this dependency as an initial hypothesis. There could also be
several hypotheses formulated for one problem at this stage. The primary step requires combined expertise
of an application domain and a data-mining model. In practice, it always means an in-depth interaction
between data-mining expert and application expert. In successful data-mining applications, this cooperation
does not stop within initial phase. It continues during whole data-mining process.
2. Collect data :
This step cares about how information is generated and picked up. Generally, there are two
distinct possibilities. The primary is when data-generation process is under control of an expert (modeler).
This approach is understood as a designed approach. The second possibility is when expert cannot influence
data generation process. This is often referred to as observational approach. An observational setting,
namely, random data generation, is assumed in most data-mining applications. it is important to form sure
that information used for estimating a model and therefore data used later for testing and applying a model
come from an equivalent, unknown, sampling distribution. If this is often not case, estimated model cannot
be successfully utilized in a final application of results.

3. Data Preprocessing :
In the observational setting, data is usually “collected” from prevailing databases, data warehouses, and
data marts. Data preprocessing usually includes a minimum of two common tasks :

a. (i) Outlier Detection (and removal) : Outliers are unusual data values that are not according to
most observations. Commonly, outliers result from measurement errors, coding, and recording
errors, and, sometimes, are natural, abnormal values. Such non-representative samples can
seriously affect model produced later. There are two strategies for handling outliers :
1. Detect and eventually remove outliers
2. Develop robust modeling methods.
b. (ii) Scaling, encoding, and selecting features : Data preprocessing includes several steps like
variable scaling and differing types of encoding. For instance, one feature with range [0, 1] and
other with range [100, 1000] will not have an equivalent weight within applied technique. They
are going to also influence ultimate data-mining results differently. Therefore, it is
recommended to scale them and convey both features to an equivalent weight for further
analysis. Also, application-specific encoding methods usually achieve dimensionality reduction
by providing a smaller number of informative features for subsequent data modeling.
4. Estimate model :
The selection and implementation of acceptable data-mining technique is that main task during this
phase. This process is not straightforward. Usually, in practice, implementation is predicated on several
models, and selecting simplest one is a further task.

5. Interpret model and draw conclusions :

In most cases, data-mining models should help in deciding. Hence, such models got to be interpretable so
as to be useful because humans are not likely to base their decisions on complex “black-box” models.
Note that goals of accuracy of model and accuracy of its interpretation are somewhat contradictory.
Usually, simple models are more interpretable, but they are also less accurate. Modern data-mining
methods are expected to yield highly accurate results using high dimensional models. The matter of
interpreting these models, also vital, is taken into account a separate task, with specific techniques to
validate results.

CRISP-DM Conceptual model:

In 1999,Several large companies including automaker dialnler Benz insurance provider ORHA,
Hardware and software manufacturer NCR Corp. And statical software maker SPSS inc. formalize and standardize
and approach two data mining process. The result of this work was CRISP-DM (Cross industry standard process for
data mining ) .

As a methodology, it include descriptions of the typical phase of a project, the task involved with
each phase and an explanation of the relationship between task.

As a process Model, CRISP- DM provides an overview of data mining life cycle .


The lifecycle model consists of six phases with arrows indicated the most important and frequent dependencies
between phases. the sequence of phase is not strict. in fact, most project moved back and forth between phases as
necessary.

1. business understanding :
In this step, The goal of the business are set and important factors that will help in
achieving the goal are discovered.
2. data understanding:
This step will collect the whole data and populate the data in the tool. The data is
listed with its data source, location, how it is required and if any issue encountered. data in visualization and
curate to check its completeness.
3. data preparation:
This step involves selecting appropriate data, cleaning, constructing attribute from data,
integrating Data from multiple database.
4. Modeling:
Selection of the data mining techniques such as decision tree, generate test design for
evaluating the selected model, building models from the data set and assessing the build model with expert
to discuss the result is done in this step.
5. Evaluation:
This step will determine the degree to which the resulting model meet the business
requirement. evaluation can be done by testing the model on real applications. the model is received for
any mistake or step should be repeated.
6. Deployment:
In this step, deployment plan is made, strategy to monitor and maintain the data mining
model, result to check for its usefulness is formed. Final reports are made and review of the whole process
is done to check any mistake and see if any step is repeated.

You might also like