Definition of data mining:
It is a non trivial process of extracting interesting,
previously unknown and potentially useful
patterns or knowledge
from huge amount of data
It is a non-trivial process of extracting interesting, previously unknown and
potentially useful patterns or knowledge from huge amount of data
What Is Data Mining?
Data mining is a collection of technologies, processes and analytical
approaches brought together to discover insights in business data that can be
used to make better decisions.
It combines statistics, artificial intelligence and machine learning to find
patterns, relationships and anomalies in large data sets.
Data mining refers to extracting or mining knowledge from large amounts of
data. The term is actually a misnomer. Thus, data mining should have been
more appropriately named as knowledge mining which emphasis on mining
from large amounts of data.
An organization can mine its data to improve many aspects of its business
Data mining can be used to find relationships and patterns in current data and
then apply those to new data to predict future trends or detect anomalies,
such as fraud.
The overall goal of the data mining process is to extract information from a
data set and transform it into an understandable structure for further use.
The key properties of data mining are:
• Automatic discovery of patterns
• Prediction of likely outcome
• Creation of actionable information
• Focus on large datasets and databases
Data mining parameters:
Data mining parameters include:
▪ Association - looking for patterns where one event is connected to another event
▪ Sequence or path analysis - looking for patterns where one event leads to another
later event
▪ Classification – is the task of generalizing known structure to apply to new data.
For example, an e-mail program might attempt to classify an e-mail as "legitimate"
or as "spam".
▪ Clustering – is the task of discovering groups and structures in the data that are
in some way or another "similar", without using known structures in the data.
How Data Mining works?
Data mining works through the concept of predictive modeling.
Data mining is the process of understanding data through cleaning raw data,
finding patterns, creating models, and testing those models.
It includes statistics, machine learning, and database systems.
Suppose an organization wants to achieve a particular result.
By analyzing a dataset where that result is known, data mining techniques
can.
for example, to build a software model that analyzes new data to predict
the likelihood of similar results. Here’s an overview:
1. Start with historical data
Let’s say a company wants to know the best customer prospects in a
new marketing database. It starts by examining its own customers.
2. Analyze the historical data
Software scans the collected data using a combination of
algorithms from statistics, artificial intelligence and machine
learning, looking for patterns and relationships in the data.
3. Write rules
Once the patterns and relationships are uncovered, the software
expresses them as rules. A rule might be that most customers ages
51 to 65 shop twice a week and fill their baskets with fresh foods,
while customers ages 21 to 50 tend to shop once a week and buy
more packaged food.
4. Apply the rules
Here, the data mining model is applied to a new marketing
database. If the company is a packaged food provider, it will be
looking for 21- to 50-year- olds.
KDD Process:
The Knowledge Discovery from Data (KDD) process is a sequence of
the following steps:
Data Cleaning: In this step noise and inconsistent data is removed.
Data Integration: - In this step multiple data sources are combined.
The data cleaning and data integration step together to form the
preprocessing of data.
The preprocessed data is then stored in the data warehouse.
Data Selection: In this step, where data relevent to the analysis task
are selected or retrived from the database.
Data Transformation: In this step, various data aggregation and
data summary techniques are applied to transform the data into a
useful form for mining.
Data Mining: In this step, Intelligent methods are applied in order to
extract data patterns.
Pattern Evaluation: In this step, the extracted data patterns are
evaluated and recognized according to the interestingness
measures.
(or Identifying interesting patterns based on some interestingness
measures.)
Knowledge Representation: Visualization and knowledge
representation techniques are used to present the mined knowledge
to the users.
Knowledge Discovery in Databases (KDD):
Some people treat data mining same as Knowledge discovery while some
people view data mining essential step in process of knowledge discovery.
Here is the list of steps involved in knowledge discovery process:
• Data Cleaning - In this step the noise and inconsistent data is removed.
• Data Integration - In this step multiple data sources are combined.
• Data Selection - In this step relevant to the analysis task are retrieved from the
database.
• Data Transformation - In this step data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation operations.
• Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
• Pattern Evaluation - In this step, data patterns are evaluated.
• Knowledge Presentation - In this step, knowledge is represented.
Functionalities of BI:
It is means data mining system are classified on the basis of
functionalities such as:
• Data Characterization: It is a summarization of the
general characteristics of an object class of data which is
under study. This refers to the summary of general
characteristics or features of the class that is under the
study.
For example: To study the characteristics of a software product
whose sales increased by 15% two years ago, anyone can
collect these type of data related to such products by running
SQL queries.
• Data Discrimination:
It compares common features of class which is under study.
The output of this process can be represented in many forms.
Eg., bar charts, curves and pie charts.
• Association and Correlation Analysis
• Data Discrimination:
It compares common features of class which is under study.
The output of this process can be represented in many forms.
Eg., bar charts, curves and pie charts.
• Correlation Analysis:
Correlation is a mathematical technique that can show
whether and how strongly the pairs of attributes are related to
each other. For example, Highted people tend to have more
weight.
• Classification: classification is the technique to categorize
elements in a collection to predefine their functionalities and
properties. In classification, the model can classify new instances
whose classification is unknown. These methods can be retrieved
to identify future data.
• Prediction: It defines predict some unavailable data
values or pending trends. It can be a prediction of missing
numerical values or increase/decrease trends in time-related
information.
• Clustering: It is similar to classification but the classes are not
predefined. The classes are represented by data attributes. It is
unsupervised learning. The objects are clustered or grouped,
depends on the principle of maximizing the intra-class similarity and
minimizing the intra-class similarity.
• Outlier Analysis: outliers are data elements that cannot be
grouped in a given class or cluster.
These are the data objects which have multiple behaviour from the
general behaviour of other data objects. The analysis of this type of
data can be essential to mine the knowledge.
• Evolution Analysis: It defines the trends for objects whose
behaviour changes over some time.
Various risks in Data Mining:
Data Privacy
While data mining on its own doesn’t pose any ethical concerns,
leaked data and unprotected data can cause data privacy
concerns.
Examples:
Very personal information like intimate photos, credit scores, or bank
account log-in details have been leaked and caused real-life distress
to users. People can lose reputations, their life savings, and maybe
even their peace of mind in the process.
Ethical Dilemmas:
Information such as medical records, location tracking, or even
search history, used to manipulate users into buying things
Inaccurate Data:
At any given time, there are two main kinds of data available to data
miners– bad data and good data.
When companies don’t sift through data properly, they’re prone to
using incomplete, duplicated, or outdated data. That won’t add value
to their businesses and unnecessarily waste a ton of money in the
process.
Overvaluing the Output:
We give a pass to unethical behavior if the outcome is good.
Advantages and disadvantages of data mining
Advantages Disadvantages
Helps to make informed decisions Rising privacy concerns
It helps detect risks and fraud Data mining requires large databases
Helps to understand behaviours, trends and
Expensive
discover hidden patterns