Data Mining Complete Notes
Data Mining Complete Notes
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled areas such as data warehouses, efficient analysis, data mining algorithm, helping
decision making and other data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to
solve business problems. It primarily turns raw data into useful information.
Definition
I) Finding hidden information in the database
II) Called exploratory data analysis, data-driven and deductive learning
III) Extracting meaningful information from the database.
In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a
sector with an extensive history.
Early techniques for identifying patterns in data include the Bayes theorem (1700s), and the
evolution of regression(1800s). The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have broad in size and
complexity level. Explicit hands-on data investigation has progressively been improved with
indirect, automatic data processing, and other computer science discoveries such as neural
networks, clustering, genetic algorithms (1950s), decision trees(1960s), and supporting vector
machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.
Classical statistics:
Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.
Artificial Intelligence:
Machine Learning:
Inductive learning:
Inductive learning, also known as discovery learning, is a process where the learner
discovers rules by observing examples. This is different from deductive learning, where
students are given rules that they then need to apply.
Meaning Extracting information from a huge Introduce new Information from data
amount of data. as well as previous experience.
History In 1930, it was known as knowledge The first program, i.e., Samuel's
discovery in databases(KDD). checker playing program, was
established in 1950.
Responsibility Data Mining is used to obtain the Machine learning teaches the
rules from the existing data. computer, how to learn and
comprehend the rules.
Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.
Techniques Data mining is more of research using It is a self-learned and train system to
involve a technique like a machine learning. do the task precisely.
Used to develop insights and guide decision-making via business intelligence (BI), data
warehouses often contain a combination of both current and historical data that has been
extracted, transformed, and loaded (ETL) from several sources, including internal and external
databases. Typically, a data warehouse acts as a business’s single source of truth (SSOT) by
centralizing data within a non-volatile and standardized system accessible to relevant
employees. Designed to facilitate online analytical processing (OLAP), and used for quick and
efficient multidimensional data analysis, data warehouses contain large stores of summarized
data that can sometimes be many petabytes large
OR
A data warehouse is a centralized repository for storing and managing large amounts of
data from various sources for analysis and reporting. It is optimized for fast querying and
analysis, enabling organizations to make informed decisions by providing a single source of
truth for data. Data warehousing typically involves transforming and integrating data from
multiple sources into a unified, organized, and consistent format.
Data warehouses provide many benefits to businesses. Some of the most common benefits
include:
• Provide a stable, centralized repository for large amounts of historical data
• Improve business processes and decision-making with actionable insights
• Increase a business’s overall return on investment (ROI)
• Improve data quality
• Enhance BI performance and capabilities by drawing on multiple sources
• Provide access to historical data business-wide
• Use AI and machine learning to improve business analytics
Data warehouse example
As data becomes more integral to the services that power our world, so too do warehouses
capable of housing and analysing large volumes of data. Whether you have realized it or not,
you likely use many of these services every day.
Here are some of the most common real-world examples of data warehouses being used today:
Health care
In recent decades, the health care industry has increasingly turned to data analytics to improve
patient care, efficiently manage operations, and reach business goals. As a result, data
scientists, data analysts, and health informatics professionals rely on data warehouses to store
and process large amounts of relevant health care data.
Banking
Open up a banking statement and you’ll likely see a long list of transactions: ATM
withdrawals, purchases, bill payments, and on and on. While the list of transactions might be
long for a single individual, they’re much longer for the many millions of customers who rely
on banking services every day. Rather than simply sitting on this wealth of data, banks use data
warehouses to store and analyze this data to develop actionable insights and improve their
service offerings.
Retail
Retailers – whether online or in-person – are always concerned about how much product
they’re buying, selling, and stocking. Today, data warehouses allow retailers to store large
amounts of transactional and customer information to help them improve their decision-making
when purchasing inventory and marketing products to their target market.
• Historical data
• Derived data
• Metadata
A data warehouse typically contains several years of historical data. The amount of data that
you decide to make available depends on available disk space and the types of analysis that
you want to support. This data can come from your transactional database archives or other
sources.
Some applications might perform analyses that require data at lower levels than users
typically view it. You will need to check with the application builder or the application's
documentation for those types of data requirements.
Derived Data
Derived data is generated from existing data using a mathematical operation or a data
transformation. It can be created as part of a database maintenance operation or generated at
run-time in response to a query.
Metadata
Metadata is data that describes the data and schema objects and is used by applications to
fetch and compute the data correctly.
Data is gathered from various sources such as hospitals, banks, organizations, and many
more and goes through a process called ETL (Extract, Transform, Load).
• Extract: This process reads the data from the database of various sources.
• Transform: It transforms the data stored inside the databases into data cubes so
that it can be loaded into the warehouse.
• Load: It is a process of writing the transformed data into the data warehouse.
Operational System
An operational system is a method used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work
with instances of data more accessible. For example, author, data build, and data changed, and
file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
The principal purpose of a data warehouse is to provide information to business managers for
strategic decision-making. These customers interact with the warehouse using end-client
access tools.
We must clean and process your operational information before put it into the warehouse.
e can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
Data Warehouse Staging Area is a temporary location where a record from source systems
is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouse that can
provide information for reporting and analysis on a section, unit, department, or operation in
the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyse historical data for purchases and sales or mine
historical information to make predictions about customer behaviour.
The problems associated with developing and managing a data warehousing are as follows:
Sometimes we underestimate the time required to extract, clean, and load the data into the
warehouse. It may take a significant proportion of the total development time, although some
tools are there which are used to reduce the time and effort spent on this process.
Hidden problems with source systems
Sometimes hidden problems associated with the source systems feeding the data warehouse
may be identified after years of being undetected. For example, when entering the details of a
new property, certain fields may allow nulls which may result in staff entering incomplete
property data, even when available and applicable.
In some cases the required data is not captured by the source systems which may be very
important for the data warehouse purpose. For example, the date of registration for the property
may be not used in source system but it may be very important analysis purpose.
After satisfying some of end-users queries, requests for support from staff may increase rather
than decrease. This is caused by an increasing awareness of the users on the capabilities and
value of the data warehouse. Another reason for increasing demands is that once a data
warehouse is online, it is often the case that the number of users and queries increase together
with requests for answers to more and more complex queries.
Data homogenization
The concept of data warehouse deals with similarity of data formats between different data
sources. Thus, results in to lose of some important value of the data.
Data ownership
Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data
that owned by one department has to be loaded in data warehouse for decision making purpose.
But some time it results in to reluctance of that department because it may hesitate to share it
with others.
High maintenance
Data warehouses are high maintenance systems. Any reorganization· of the business processes
and the source systems may affect the data warehouse and it results high maintenance cost.
Long-duration projects
The building of a warehouse can take up to three years, which is why some organizations are
reluctant in investigating in to data warehouse. Some only the historical data of a particular
department is captured in the data warehouse resulting data marts. Data marts support only the
requirements of a particular department and limited the functionality to that department or area
only.
Complexity of integration
The most important area for the management of a data warehouse is the integration capabilities.
An organization must spend a significant amount of time determining how well the various
different data warehousing tools can be integrated into the overall solution that is needed. This
can be a very difficult task, as there are a number of tools for every operation of the data
warehouse.
• Data Mart helps to enhance user’s response time due to reduction in volume of data
• It provides easy access to frequently requested data.
• Data mart are simpler to implement when compared to corporate Datawarehouse. At
the same time, the cost of implementing Data Mart is certainly lower compared with
implementing a full data warehouse.
• Compared to Data Warehouse, a DataMart is agile. In case of change in model,
DataMart can be built quicker due to a smaller size.
• Data can be segmented and stored on different hardware/software platforms.
1. Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
UNIT 3
Data Mining Functions
Classification
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets
as well as small and simple ones.
It primarily involves using algorithms that you can easily modify to improve the data quality.
This is a big reason why supervised learning is particularly common with classification in
techniques in data mining. The primary goal of classification is to connect a variable of
interest with the required variables. The variable of interest should be of qualitative type.
There are multiple types of classification algorithms, each with its unique functionality and
application. All those algorithms are used to extract data from a dataset. Which application
you use for a particular task depends on the goal of the task and the kind of data you need to
extract.
Associations
Association is a data mining technique that discovers the probability of the co-occurrence of
items in a collection. The relationships between co-occurring items are expressed as
Association Rules.
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant
patterns between data examples where the values are delivered in a sequence. It is usually
presumed that the values are discrete, and thus time series mining is closely related, but usually
considered a different activity.
The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
Clustering
Clustering uses machine learning (ML) algorithms to identify similarities in customer data.
The algorithms review your customer data, note similarities humans might have missed, and
put customers in clusters based on patterns in their behaviour.
Clustering analysis is a data mining technique to identify similar data. This technique helps to
recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.
Segmentation
When a marketer chooses to pull certain groups from a large body of data, that’s
segmentation. Put another way, it’s when you look at your customer data and pick out
specific criteria to target a group.
UNIT 4
Data Mining Techniques
1. Cluster Analysis: Cluster analysis is a technique used to group similar data points
together based on their characteristics. It is commonly used in customer segmentation,
market research, and image processing.
2. Induction: Induction is a technique used to learn rules or patterns from data. It involves
analyzing a set of training examples to build a model that can be used to predict
outcomes for new data points.
3. Decision Trees: Decision trees are a type of model that can be used for both
classification and regression tasks. They involve recursively splitting the data based on
the most informative features to make decisions about the target variable.
4. Rule Induction: Rule induction is a technique used to learn rules from data. It involves
analyzing a set of training examples to identify common patterns or rules that can be
used to make predictions.
5. Neural Networks: Neural networks are a type of machine learning algorithm that are
inspired by the structure and function of the human brain. They can be used for both
supervised and unsupervised learning tasks, and are particularly effective at tasks
involving image and speech recognition.
These techniques are all commonly used in data mining and machine learning. The choice of
technique will depend on the specific problem being addressed and the nature of the data.
Cluster Analysis: Cluster analysis is a technique used to group similar data points together
based on their characteristics.
The goal is to find groups or clusters of data points that are like each other but
different from those in other clusters.
Cluster analysis can be performed using different methods, such as hierarchical
clustering or k-means clustering.
In hierarchical clustering, data points are grouped together based on their similarity,
and the groups are combined into larger clusters until all data points are in a single
cluster.
In k-means clustering, the number of clusters is pre-defined, and the algorithm
assigns each data point to the nearest cluster centroid.
Cluster analysis is often used in customer segmentation, market research, and image
processing.
Decision Trees: Decision trees are a type of model that can be used for both classification
and regression tasks.
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue”, “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
They involve recursively splitting the data based on the most informative features
to make decisions about the target variable.
Each internal node of the tree represents a decision based on a feature, and each leaf
node represents a classification or regression outcome.
Decision trees are easy to interpret and can handle both categorical and continuous
features.
However, they can be prone to overfitting and may not perform well on complex
datasets.
Rule Induction: Rule induction is a technique used to learn rules from data.
It involves analysing a set of training examples to identify common patterns or rules
that can be used to make predictions.
Rule induction can be performed using different methods, such as association rule
mining or decision rule learning.
In association rule mining, the algorithm discovers relationships between different
variables, such as items frequently purchased together in a market basket analysis.
In decision rule learning, the algorithm learns a set of if-then rules that can be used
to classify new instances based on their features.
****If I raise the price of this pen then what will be its purchase rate
Rule induction is commonly used in data mining and predictive modelling.
Neural Networks: Neural networks are a type of machine learning algorithm that are
inspired by the structure and function of the human brain.
They can be used for both supervised and unsupervised learning tasks and are
particularly effective at tasks involving image and speech recognition.
A neural network is composed of multiple layers of interconnected nodes or
neurons, and each neuron performs a simple mathematical operation on its inputs.
The weights and biases of the neurons are learned from the training data, allowing
the network to make predictions for new instances.
Neural networks are highly flexible and can handle complex datasets, but they can
be difficult to interpret and may require a large amount of training data.
4.2
OLAP (Online Analytical Processing) is a technology that allows for complex analysis of
large amounts of data. It is typically used in business intelligence and data warehousing
applications to enable decision-makers to explore and analyse data in a multidimensional way.
1. Online banking: OLTP is used to process transactions such as deposits, withdrawals, and
transfers in real-time.
2. E-commerce: OLTP is used to process online orders, inventory updates, and shipping
information in real-time.
3. Healthcare: OLTP is used to manage patient records, appointments, and billing information
in real-time.
The key difference between OLAP and OLTP is that OLAP is used for complex analysis of
large amounts of data, while OLTP is used for real-time transaction processing.
It is well-known as an online
It is well-known as an online
database query management
database modifying system.
1. Definition system.
It provides a multi-dimensional
It reveals a snapshot of present
view of different business
business tasks.
7. Task tasks.
It is comparatively fast in
The processing of complex
Processing processing because of simple and
queries can take a lengthy time.
13. time straightforward queries.
In summary, OLAP and OLTP are two different technologies used for different purposes, and
data visualization is a useful tool for analyzing and exploring data in both contexts.
Data mining is the process of extracting valuable insights and knowledge from large amounts
of data. It involves the use of statistical and machine learning algorithms to discover patterns
and relationships in the data.
UNIT 5
Data Mining Applications
2. Fraud detection: Data mining can be used to identify patterns of fraudulent behavior in
financial transactions, insurance claims, and other areas.
3. Healthcare: Data mining can be used to identify patterns and trends in medical data to
improve patient outcomes, reduce costs, and identify potential health risks.
4. Social media analysis: Data mining can be used to analyze social media data to understand
customer sentiment, identify trends, and improve social media marketing strategies.
2. Big data analytics: With the increasing volume, velocity, and variety of data, big data
analytics has become a major trend in data mining. Big data analytics involves the use of
distributed computing and parallel processing to analyze large amounts of data.
3. Explainable AI: Explainable AI is a new trend in machine learning that focuses on making
AI models more transparent and understandable. This is particularly important in applications
where decisions have a significant impact on human lives, such as healthcare and finance.
4. Edge computing: Edge computing involves the processing of data at the edge of the
network, closer to the data source. This can help to reduce latency and improve the efficiency
of data processing.
In summary, data mining is a powerful technology that has numerous applications in various
industries. Recent trends in data mining include deep learning, big data analytics, explainable
AI, and edge computing.