0% found this document useful (0 votes)

23 views26 pages

Data Mining Complete Notes

Data mining is a technique for extracting valuable information from large datasets, involving processes like data cleaning, integration, and pattern evaluation. It has historical roots in statistics, artificial intelligence, and machine learning, and is distinct from machine learning in its focus on extracting information rather than teaching algorithms. Data warehousing serves as a centralized repository for data analysis and reporting, enhancing decision-making through structured data management and analytics.

Uploaded by

anshchauhan256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views26 pages

Data Mining Complete Notes

Uploaded by

anshchauhan256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

UNIT 1

Introduction to Data Mining

Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.

The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take data-driven decision from huge sets of data is called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled areas such as data warehouses, efficient analysis, data mining algorithm, helping
decision making and other data requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases to
solve business problems. It primarily turns raw data into useful information.
Definition
I) Finding hidden information in the database
II) Called exploratory data analysis, data-driven and deductive learning
III) Extracting meaningful information from the database.

Background of Data Mining

In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a
sector with an extensive history.

Early techniques for identifying patterns in data include the Bayes theorem (1700s), and the
evolution of regression(1800s). The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have broad in size and
complexity level. Explicit hands-on data investigation has progressively been improved with
indirect, automatic data processing, and other computer science discoveries such as neural
networks, clustering, genetic algorithms (1950s), decision trees(1960s), and supporting vector
machines (1990s).

Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.

Classical statistics:

Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.

(A regression is a statistical technique that relates a dependent variable to one or more

independent (explanatory) variables. A regression model is able to show whether changes
observed in the dependent variable are associated with changes in one or more of the
explanatory variables.)

Artificial Intelligence:

AI or Artificial intelligence is based on heuristics as opposed to statistics. It tries to apply

human- thought like processing to statistical problems. A specific AI concept was adopted by
some high-end commercial products, such as query optimization modules for Relational
Database Management System(RDBMS).

Machine Learning:

Machine learning is a combination of statistics and AI. It might be considered as an evolution

of AI because it mixes AI heuristics with complex statistical analysis. Machine learning tries
to enable computer programs to know about the data they are studying so that programs make
a distinct decision based on the characteristics of the data examined. It uses statistics for basic
concepts and adding more AI heuristics and algorithms to accomplish its target.

Inductive learning:

Inductive learning, also known as discovery learning, is a process where the learner
discovers rules by observing examples. This is different from deductive learning, where
students are given rules that they then need to apply.

Difference between data mining and machine learning

S.No. Data Mining Machine Learning

Extracting useful information Introduce algorithm from data as well as from

1. from large amount of data past experience

Teaches the computer to learn and understand

2. Used to understand the data flow from the data flow

Huge databases with

3. unstructured data Existing data as well as algorithms

machine learning algorithm can be used in the

Models can be developed for decision tree, neural networks and some other
4. using data mining technique area of artificial intelligence

5. human interference is more in it. No human effort required after design

It is used in web Search, spam filter, fraud

6. It is used in cluster analysis detection and computer design

Data mining abstract from the

7. data warehouse Machine learning reads machine

Data mining is more of a

research using methods like Self learned and trains system to do the intelligent
8. machine learning task

9. Applied in limited area Can be used in vast area

Factors Data Mining Machine Learning

Origin Traditional databases with It has an existing algorithm and data.

unstructured data.

Meaning Extracting information from a huge Introduce new Information from data
amount of data. as well as previous experience.

History In 1930, it was known as knowledge The first program, i.e., Samuel's
discovery in databases(KDD). checker playing program, was
established in 1950.

Responsibility Data Mining is used to obtain the Machine learning teaches the
rules from the existing data. computer, how to learn and
comprehend the rules.

Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.

Applications In compare to machine learning, data It needs a large amount of data to

mining can produce outcomes on the obtain accurate results. It has various
lesser volume of data. It is also used applications, used in web search, spam
in cluster analysis. filter, credit scoring, computer design,
etc.

Nature It involves human interference more It is automated, once designed and

towards the manual. implemented, there is no need for
human effort.

Techniques Data mining is more of research using It is a self-learned and train system to
involve a technique like a machine learning. do the task precisely.

Scope Applied in the limited fields. It can be used in a vast area.

UNIT 2
Introduction to Data Warehousing

Concept and benefits of data warehousing

Data warehouses store and process large amounts of data from various sources within a business.
An integral component of business intelligence (BI), data warehouses help businesses make
better, more informed decisions by applying data analytics to large volumes of information.

A data warehouse, or “enterprise data warehouse” (EDW), is a central repository system in

which businesses store valuable information, such as customer and sales data, for analytics and
reporting purposes.

Used to develop insights and guide decision-making via business intelligence (BI), data
warehouses often contain a combination of both current and historical data that has been
extracted, transformed, and loaded (ETL) from several sources, including internal and external
databases. Typically, a data warehouse acts as a business’s single source of truth (SSOT) by
centralizing data within a non-volatile and standardized system accessible to relevant
employees. Designed to facilitate online analytical processing (OLAP), and used for quick and
efficient multidimensional data analysis, data warehouses contain large stores of summarized
data that can sometimes be many petabytes large

A data warehouse is a centralized repository for storing and managing large amounts of
data from various sources for analysis and reporting. It is optimized for fast querying and
analysis, enabling organizations to make informed decisions by providing a single source of
truth for data. Data warehousing typically involves transforming and integrating data from
multiple sources into a unified, organized, and consistent format.

Data warehouse benefits

Data warehouses provide many benefits to businesses. Some of the most common benefits
include:
• Provide a stable, centralized repository for large amounts of historical data
• Improve business processes and decision-making with actionable insights
• Increase a business’s overall return on investment (ROI)
• Improve data quality
• Enhance BI performance and capabilities by drawing on multiple sources
• Provide access to historical data business-wide
• Use AI and machine learning to improve business analytics
Data warehouse example

As data becomes more integral to the services that power our world, so too do warehouses
capable of housing and analysing large volumes of data. Whether you have realized it or not,
you likely use many of these services every day.

Here are some of the most common real-world examples of data warehouses being used today:

Health care

In recent decades, the health care industry has increasingly turned to data analytics to improve
patient care, efficiently manage operations, and reach business goals. As a result, data
scientists, data analysts, and health informatics professionals rely on data warehouses to store
and process large amounts of relevant health care data.

Banking

Open up a banking statement and you’ll likely see a long list of transactions: ATM
withdrawals, purchases, bill payments, and on and on. While the list of transactions might be
long for a single individual, they’re much longer for the many millions of customers who rely
on banking services every day. Rather than simply sitting on this wealth of data, banks use data
warehouses to store and analyze this data to develop actionable insights and improve their
service offerings.

Retail

Retailers – whether online or in-person – are always concerned about how much product
they’re buying, selling, and stocking. Today, data warehouses allow retailers to store large
amounts of transactional and customer information to help them improve their decision-making
when purchasing inventory and marketing products to their target market.

Types of Data Stored in a Data Warehouse

data warehouse will store these types of data:

• Historical data
• Derived data
• Metadata

These types of data are discussed individually.

Historical Data

A data warehouse typically contains several years of historical data. The amount of data that
you decide to make available depends on available disk space and the types of analysis that
you want to support. This data can come from your transactional database archives or other
sources.

Some applications might perform analyses that require data at lower levels than users
typically view it. You will need to check with the application builder or the application's
documentation for those types of data requirements.

Derived Data

Derived data is generated from existing data using a mathematical operation or a data
transformation. It can be created as part of a database maintenance operation or generated at
run-time in response to a query.

Metadata

Metadata is data that describes the data and schema objects and is used by applications to
fetch and compute the data correctly.

Characteristics of Data Warehousing

1. Subject-oriented – A data warehouse is always a subject oriented as it delivers
information about a theme instead of organization’s current operations. It can be
achieved on specific theme. That means the data warehousing process is proposed to
handle with a specific theme which is more defined. These themes can be sales,
distributions, marketing etc.
2. Integrated – It is somewhere same as subject orientation which is made in a reliable
format. Integration means founding a shared entity to scale the all similar data from
the different databases. The data also required to be resided into various data
warehouse in shared and generally granted manner.
3. Time-Variant – In this data is maintained via different intervals of time such as
weekly, monthly, or annually etc. It founds various time limit which are structured
between the large datasets and are held in online transaction process (OLTP)
4. Non-Volatile – As the name defines the data resided in data warehouse is
permanent. It also means that data is not erased or deleted when new data is inserted.

Processes in Data Warehousing

Data warehousing and data mining are closely related processes that are used to extract
valuable insights from large amounts of data. The data warehouse process is a multi-step
process that involves the following steps:
1. Data Extraction: The first step in the data warehouse process is to extract data
from various sources such as transactional systems, spreadsheets, and flat files.
2. Data Cleaning: After the data is extracted, it is cleaned to remove any
inconsistencies, errors, or duplicates. This step also includes data validation to
ensure that the data is accurate and complete.
3. Data Transformation: In this step, the extracted and cleaned data is
transformed into a format that is suitable for loading into the data warehouse.
This may involve converting data types, combining data from multiple sources,
or creating new data fields.
4. Data Loading: After the data is transformed, it is loaded into the data
warehouse. This step involves creating the physical data structures and loading
the data into the warehouse.
5. Data Indexing: After the data is loaded into the data warehouse, it is indexed to
make it easy to search and retrieve the data. This step also involves creating
summary tables and materialized views to improve query performance.
6. Data Maintenance: The final step in the data warehouse process is to maintain
the data and ensure that it is accurate and up-to-date. This may involve
periodically refreshing the data, archiving old data, and monitoring the data for
errors or inconsistencies.

Data is gathered from various sources such as hospitals, banks, organizations, and many
more and goes through a process called ETL (Extract, Transform, Load).
• Extract: This process reads the data from the database of various sources.
• Transform: It transforms the data stored inside the databases into data cubes so
that it can be loaded into the warehouse.

• Load: It is a process of writing the transformed data into the data warehouse.

Online-Transaction processing (OLTP) : Online-Transaction Processing is a technique

used for detailed day-to-day transactions of data which continuously chain on an everyday-
basis. We can describe OLTP as a large number of short, online transactions in which there
are detailed, and current schematics used to store data into a transaction database like Third
Normal Form (3NF). It typically uses a traditional database that includes insertion, deletion,
and update while also supporting query requirements.
Difference between Data Warehousing and Online-Transaction processing (OLTP):

Data Warehousing DWH Online transaction

It is technique that gathers or collect data It is technique that is used for detailed day to
from different sources into central day transaction data which keep chaining on
repository. every day.
It is designed for decision making process. It is designed for business transaction process.
It stores large amount of data or historical
data. It holds current data.
It used for analysing the business. It used for running the business.
In Data warehousing, the size of database In Online transaction processing, the size of
is around 100GB-2TB. data base is around 10MB-100GB.
In Data warehousing, denormalized data is In Online transaction processing, normalized
present. data is present.
It uses Query processing. It uses transaction processing
It is subject-oriented. It is application-oriented.
In Data warehousing, data redundancy is In Online transaction processing, there is no
present. data redundancy.

Data warehouses and their architectures vary depending on the elements of an

organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and work
with instances of data more accessible. For example, author, data build, and data changed, and
file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.

End-User Access Tools

The principal purpose of a data warehouse is to provide information to business managers for
strategic decision-making. These customers interact with the warehouse using end-client
access tools.

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the warehouse.

e can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).

Data Warehouse Staging Area is a temporary location where a record from source systems
is copied.
Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouse that can
provide information for reporting and analysis on a section, unit, department, or operation in
the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyse historical data for purchases and sales or mine
historical information to make predictions about customer behaviour.

Problems of Data Warehousing

The problems associated with developing and managing a data warehousing are as follows:

Underestimation of resources for data loading

Sometimes we underestimate the time required to extract, clean, and load the data into the
warehouse. It may take a significant proportion of the total development time, although some
tools are there which are used to reduce the time and effort spent on this process.
Hidden problems with source systems

Sometimes hidden problems associated with the source systems feeding the data warehouse
may be identified after years of being undetected. For example, when entering the details of a
new property, certain fields may allow nulls which may result in staff entering incomplete
property data, even when available and applicable.

Required data not captured

In some cases the required data is not captured by the source systems which may be very
important for the data warehouse purpose. For example, the date of registration for the property
may be not used in source system but it may be very important analysis purpose.

Increased end-user demands

After satisfying some of end-users queries, requests for support from staff may increase rather
than decrease. This is caused by an increasing awareness of the users on the capabilities and
value of the data warehouse. Another reason for increasing demands is that once a data
warehouse is online, it is often the case that the number of users and queries increase together
with requests for answers to more and more complex queries.

Data homogenization

The concept of data warehouse deals with similarity of data formats between different data
sources. Thus, results in to lose of some important value of the data.

High demand for resources

The data warehouse requires large amounts of data.

Data ownership

Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data
that owned by one department has to be loaded in data warehouse for decision making purpose.
But some time it results in to reluctance of that department because it may hesitate to share it
with others.

High maintenance

Data warehouses are high maintenance systems. Any reorganization· of the business processes
and the source systems may affect the data warehouse and it results high maintenance cost.

Long-duration projects

The building of a warehouse can take up to three years, which is why some organizations are
reluctant in investigating in to data warehouse. Some only the historical data of a particular
department is captured in the data warehouse resulting data marts. Data marts support only the
requirements of a particular department and limited the functionality to that department or area
only.

Complexity of integration

The most important area for the management of a data warehouse is the integration capabilities.
An organization must spend a significant amount of time determining how well the various
different data warehousing tools can be integrated into the overall solution that is needed. This
can be a very difficult task, as there are a number of tools for every operation of the data
warehouse.

Why do we need Data Mart?

• Data Mart helps to enhance user’s response time due to reduction in volume of data
• It provides easy access to frequently requested data.
• Data mart are simpler to implement when compared to corporate Datawarehouse. At
the same time, the cost of implementing Data Mart is certainly lower compared with
implementing a full data warehouse.
• Compared to Data Warehouse, a DataMart is agile. In case of change in model,
DataMart can be built quicker due to a smaller size.
• Data can be segmented and stored on different hardware/software platforms.

Types of Data Mart

There are three main types of data mart:

1. Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
UNIT 3
Data Mining Functions

Classification
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets
as well as small and simple ones.
It primarily involves using algorithms that you can easily modify to improve the data quality.
This is a big reason why supervised learning is particularly common with classification in
techniques in data mining. The primary goal of classification is to connect a variable of
interest with the required variables. The variable of interest should be of qualitative type.
There are multiple types of classification algorithms, each with its unique functionality and
application. All those algorithms are used to extract data from a dataset. Which application
you use for a particular task depends on the goal of the task and the kind of data you need to
extract.

Types of Classification Techniques in Data Mining

We can divide the classification algorithms into two categories:
1. Generative
2. Discriminative
Generative
A generative classification algorithm models the distribution of individual classes. It tries to
learn the model which creates the data through the estimation of distributions and
assumptions of the model. We can use generative algorithms to predict unseen data.
A prominent generative algorithm is the Naive Bayes Classifier.
Discriminative
It is a rudimentary classification algorithm that determines a class for a row of data. It models
by using the observed data and depends on the data quality instead of its distributions.
Logistic regression is an excellent type of discriminative classifiers.

Associations

Association is a data mining technique that discovers the probability of the co-occurrence of
items in a collection. The relationships between co-occurring items are expressed as
Association Rules.

This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.

Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.

Sequential pattern mining in data mining

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant
patterns between data examples where the values are delivered in a sequence. It is usually
presumed that the values are discrete, and thus time series mining is closely related, but usually
considered a different activity.

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

Clustering

Clustering uses machine learning (ML) algorithms to identify similarities in customer data.
The algorithms review your customer data, note similarities humans might have missed, and
put customers in clusters based on patterns in their behaviour.

Clustering analysis is a data mining technique to identify similar data. This technique helps to
recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.

Segmentation

When a marketer chooses to pull certain groups from a large body of data, that’s
segmentation. Put another way, it’s when you look at your customer data and pick out
specific criteria to target a group.
UNIT 4
Data Mining Techniques

1. Cluster Analysis: Cluster analysis is a technique used to group similar data points
together based on their characteristics. It is commonly used in customer segmentation,
market research, and image processing.

2. Induction: Induction is a technique used to learn rules or patterns from data. It involves
analyzing a set of training examples to build a model that can be used to predict
outcomes for new data points.

3. Decision Trees: Decision trees are a type of model that can be used for both
classification and regression tasks. They involve recursively splitting the data based on
the most informative features to make decisions about the target variable.

4. Rule Induction: Rule induction is a technique used to learn rules from data. It involves
analyzing a set of training examples to identify common patterns or rules that can be
used to make predictions.

5. Neural Networks: Neural networks are a type of machine learning algorithm that are
inspired by the structure and function of the human brain. They can be used for both
supervised and unsupervised learning tasks, and are particularly effective at tasks
involving image and speech recognition.

6. Online Analytical Processing: Online analytical processing (OLAP) is a technique

used for interactive analysis of large datasets. It involves storing data in a
multidimensional format that can be easily queried and visualized using tools such as
pivot tables and charts. OLAP is commonly used for business intelligence and
reporting.

These techniques are all commonly used in data mining and machine learning. The choice of
technique will depend on the specific problem being addressed and the nature of the data.
Cluster Analysis: Cluster analysis is a technique used to group similar data points together
based on their characteristics.
The goal is to find groups or clusters of data points that are like each other but
different from those in other clusters.
Cluster analysis can be performed using different methods, such as hierarchical
clustering or k-means clustering.
In hierarchical clustering, data points are grouped together based on their similarity,
and the groups are combined into larger clusters until all data points are in a single
cluster.
In k-means clustering, the number of clusters is pre-defined, and the algorithm
assigns each data point to the nearest cluster centroid.
Cluster analysis is often used in customer segmentation, market research, and image
processing.

Induction: Induction is a technique used to learn rules or patterns from data.

It involves analysing a set of training examples to build a model that can be used to
predict outcomes for new data points.
Induction can be performed using different methods, such as decision trees or rule
induction.
In decision trees, the data is recursively split based on the most informative features
to make decisions about the target variable.
In rule induction, the algorithm learns a set of rules that can be used to classify new
instances based on their features.
Induction is commonly used in machine learning and predictive modelling.

Decision Trees: Decision trees are a type of model that can be used for both classification
and regression tasks.
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue”, “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
They involve recursively splitting the data based on the most informative features
to make decisions about the target variable.
Each internal node of the tree represents a decision based on a feature, and each leaf
node represents a classification or regression outcome.
Decision trees are easy to interpret and can handle both categorical and continuous
features.
However, they can be prone to overfitting and may not perform well on complex
datasets.

Rule Induction: Rule induction is a technique used to learn rules from data.
It involves analysing a set of training examples to identify common patterns or rules
that can be used to make predictions.
Rule induction can be performed using different methods, such as association rule
mining or decision rule learning.
In association rule mining, the algorithm discovers relationships between different
variables, such as items frequently purchased together in a market basket analysis.
In decision rule learning, the algorithm learns a set of if-then rules that can be used
to classify new instances based on their features.
****If I raise the price of this pen then what will be its purchase rate
Rule induction is commonly used in data mining and predictive modelling.

Neural Networks: Neural networks are a type of machine learning algorithm that are
inspired by the structure and function of the human brain.
They can be used for both supervised and unsupervised learning tasks and are
particularly effective at tasks involving image and speech recognition.
A neural network is composed of multiple layers of interconnected nodes or
neurons, and each neuron performs a simple mathematical operation on its inputs.
The weights and biases of the neurons are learned from the training data, allowing
the network to make predictions for new instances.
Neural networks are highly flexible and can handle complex datasets, but they can
be difficult to interpret and may require a large amount of training data.

Online Analytical Processing: Online analytical processing (OLAP) is a technique used

for the interactive analysis of large datasets.
It involves storing data in a multidimensional format that can be easily queried and
visualized using tools such as pivot tables and charts.
OLAP is commonly used for business intelligence and reporting and can provide
insights into trends and patterns in the data.
OLAP allows users to slice and dice the data in different ways, such as by time,
geography, or product category, to gain a better understanding of the underlying
trends and relationships in the data.

4.2

OLAP (Online Analytical Processing) is a technology that allows for complex analysis of
large amounts of data. It is typically used in business intelligence and data warehousing
applications to enable decision-makers to explore and analyse data in a multidimensional way.

Some examples of OLAP include:

OLAP can be used to analyze sales data by product, region, time period, and other dimensions
1. Sales analysis: to identify trends, opportunities, and areas for improvement.
2. Financial analysis: OLAP can be used to analyze financial data such as revenue, expenses,
and profit by business unit, product line, and other dimensions to identify areas for cost
reduction and revenue growth.
3. Inventory analysis: OLAP can be used to analyze inventory data by product, location, and
other dimensions to optimize inventory levels and improve supply chain management.
OLTP (Online Transaction Processing), is a technology that is used to manage and process
day-to-day business transactions in real-time. It is typically used in operational systems such
as banking, e-commerce, and order processing.

Some examples of OLTP include:

1. Online banking: OLTP is used to process transactions such as deposits, withdrawals, and
transfers in real-time.

2. E-commerce: OLTP is used to process online orders, inventory updates, and shipping
information in real-time.

3. Healthcare: OLTP is used to manage patient records, appointments, and billing information
in real-time.
The key difference between OLAP and OLTP is that OLAP is used for complex analysis of
large amounts of data, while OLTP is used for real-time transaction processing.

Comparisons of OLAP vs OLTP:

Sr. OLAP (Online analytical OLTP (Online transaction
No. Category processing) processing)

It is well-known as an online
It is well-known as an online
database query management
database modifying system.
1. Definition system.

Consists of historical data from Consists of only of operational

2. Data source various Databases. current data.

It makes use of a standard

It makes use of a data
database management system
warehouse.
3. Method used (DBMS).

It is subject-oriented. Used for

It is application-oriented. Used for
Data Mining, Analytics,
business tasks.
4. Application Decisions making, etc.

In an OLAP database, tables In an OLTP database, tables are

5. Normalized are not normalized. normalized (3NF).

The data is used in planning,

The data is used to perform day-
problem-solving, and decision-
to-day fundamental operations.
6. Usage of data making.

It provides a multi-dimensional
It reveals a snapshot of present
view of different business
business tasks.
7. Task tasks.

It serves the purpose to extract It serves the purpose to Insert,

information for analysis and Update, and Delete information
8. Purpose decision-making. from the database.

The size of the data is relatively

A large amount of data is
small as the historical data is
Volume of stored typically in TB, PB
archived. For ex MB, GB
9. data
Sr. OLAP (Online analytical OLTP (Online transaction
No. Category processing) processing)

Relatively slow as the amount

Very Fast as the queries operate on
of data involved is large.
5% of the data.
10. Queries Queries may take hours.

The OLAP database is not The data integrity constraint must

often updated. As a result, data be maintained in an OLTP
11. Update integrity is unaffected. database.

It only need backup from time Backup and recovery process is

Backup and
to time as compared to OLTP. maintained rigorously
12. Recovery

It is comparatively fast in
The processing of complex
Processing processing because of simple and
queries can take a lengthy time.
13. time straightforward queries.

This data is generally managed This data is managed by clerks,

by CEO, MD, GM. managers.
14. Types of users

Only read and rarely write

Both read and write operations.
15. Operations operation.

With lengthy, scheduled batch

The user initiates data updates,
operations, data is refreshed on
which are brief and quick.
16. Updates a regular basis.

Nature of Process that is focused on the Process that is focused on the

17. audience customer. market.

Database Design with a focus on the Design that is focused on the

18. Design subject. application.

Improves the efficiency of

Enhances the user’s productivity.
19. Productivity business analysts.
Data visualization is the process of representing data graphically to help users understand and
analyse it. Data visualization tools are often used in conjunction with OLAP and OLTP systems
to enable decision-makers to explore and analyse data in a visual way.

Some examples of data visualization tools include:

1. Tableau: A data visualization tool that allows users to create interactive dashboards and
reports.
2. Microsoft Power BI: A business intelligence tool that enables users to visualize and analyze
data from multiple sources.
3. Google Data Studio: A free data visualization tool that allows users to create interactive
reports and dashboards.

In summary, OLAP and OLTP are two different technologies used for different purposes, and
data visualization is a useful tool for analyzing and exploring data in both contexts.

Data mining is the process of extracting valuable insights and knowledge from large amounts
of data. It involves the use of statistical and machine learning algorithms to discover patterns
and relationships in the data.
UNIT 5
Data Mining Applications

Some applications of data mining include:

1. Customer segmentation: Data mining can be used to segment customers into groups based
on their buying habits, demographics, and other factors. This can help businesses personalize
their marketing campaigns and improve customer retention.

2. Fraud detection: Data mining can be used to identify patterns of fraudulent behavior in
financial transactions, insurance claims, and other areas.

3. Healthcare: Data mining can be used to identify patterns and trends in medical data to
improve patient outcomes, reduce costs, and identify potential health risks.

4. Social media analysis: Data mining can be used to analyze social media data to understand
customer sentiment, identify trends, and improve social media marketing strategies.

Recent trends in data mining include:

1. Deep learning: Deep learning is a subset of machine learning that uses neural networks to
extract features from large amounts of data. It has been used in a variety of applications,
including image recognition, speech recognition, and natural language processing.

2. Big data analytics: With the increasing volume, velocity, and variety of data, big data
analytics has become a major trend in data mining. Big data analytics involves the use of
distributed computing and parallel processing to analyze large amounts of data.

3. Explainable AI: Explainable AI is a new trend in machine learning that focuses on making
AI models more transparent and understandable. This is particularly important in applications
where decisions have a significant impact on human lives, such as healthcare and finance.

4. Edge computing: Edge computing involves the processing of data at the edge of the
network, closer to the data source. This can help to reduce latency and improve the efficiency
of data processing.

In summary, data mining is a powerful technology that has numerous applications in various
industries. Recent trends in data mining include deep learning, big data analytics, explainable
AI, and edge computing.

Data Mining
No ratings yet
Data Mining
26 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
DM - Unit4
No ratings yet
DM - Unit4
15 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
Unit 4 Introduction To Data Mining
No ratings yet
Unit 4 Introduction To Data Mining
22 pages
358 44 Datamining and Warehousing 4.4
No ratings yet
358 44 Datamining and Warehousing 4.4
155 pages
Unit 1
No ratings yet
Unit 1
27 pages
Chapter One
No ratings yet
Chapter One
30 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Data Minng
No ratings yet
Data Minng
20 pages
Unit-1 (Data Mining)
No ratings yet
Unit-1 (Data Mining)
13 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Data Mining
No ratings yet
Data Mining
22 pages
Session 35 - Data Mining and Data Warehousing
No ratings yet
Session 35 - Data Mining and Data Warehousing
14 pages
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
No ratings yet
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
5 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Unit - II DW
No ratings yet
Unit - II DW
20 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Chapter 6 - Data Mining Techniques
No ratings yet
Chapter 6 - Data Mining Techniques
19 pages
Presentation On Data Mining
100% (1)
Presentation On Data Mining
51 pages
Introduction to Data Mining Basics
No ratings yet
Introduction to Data Mining Basics
31 pages
Data Mining
No ratings yet
Data Mining
14 pages
DB 14
No ratings yet
DB 14
97 pages
Hu DM 2024
No ratings yet
Hu DM 2024
205 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining
No ratings yet
Data Mining
395 pages
Data Mining Book
No ratings yet
Data Mining Book
96 pages
1 DM Intro
No ratings yet
1 DM Intro
34 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Data Mining: Applications and Techniques
No ratings yet
Data Mining: Applications and Techniques
60 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
15 pages
Es 2646574663
No ratings yet
Es 2646574663
7 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
001.data Mining and Data Warewhouse
No ratings yet
001.data Mining and Data Warewhouse
7 pages
B SC (IT) VI-DSE3-M5
No ratings yet
B SC (IT) VI-DSE3-M5
13 pages
DM Notes
No ratings yet
DM Notes
193 pages
Week1 1
No ratings yet
Week1 1
18 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
Connecting The Dots To Make Sense of Data
No ratings yet
Connecting The Dots To Make Sense of Data
8 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
86 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Mining: The Basic Concept
No ratings yet
Data Mining: The Basic Concept
23 pages
Chapter 1&2
No ratings yet
Chapter 1&2
91 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
DWDM Lab Using Python
No ratings yet
DWDM Lab Using Python
15 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
1 DM Intro1
No ratings yet
1 DM Intro1
34 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
40 pages
Web Data Mining: A Case Study: Samia Jones
No ratings yet
Web Data Mining: A Case Study: Samia Jones
6 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
AI Course Overview & Problem Solving
No ratings yet
AI Course Overview & Problem Solving
28 pages
Automated Indoor Farming Innovations
No ratings yet
Automated Indoor Farming Innovations
26 pages
Sai University Prospectus 2025
No ratings yet
Sai University Prospectus 2025
44 pages
2022 Multimodal Brain Tumor Detection Using Multimodal Deep Transfer Learning
No ratings yet
2022 Multimodal Brain Tumor Detection Using Multimodal Deep Transfer Learning
11 pages
Artificial Intelligence To Improve Education / Learning Challenges
No ratings yet
Artificial Intelligence To Improve Education / Learning Challenges
4 pages
ABHINAV's Resume-Hackerresume
No ratings yet
ABHINAV's Resume-Hackerresume
1 page
Job Roadmap For Students (No Degree Required) .XLSX - Roadmap
No ratings yet
Job Roadmap For Students (No Degree Required) .XLSX - Roadmap
8 pages
Talent Management in Digital HRM
No ratings yet
Talent Management in Digital HRM
7 pages
Manish Bhatt 2451137 ProjectIV
No ratings yet
Manish Bhatt 2451137 ProjectIV
20 pages
Class 10 Artificial Intelligence Sample Paper Set 12
No ratings yet
Class 10 Artificial Intelligence Sample Paper Set 12
9 pages
Deborah Kwafo: Master's Thesis Accounting April 2019
No ratings yet
Deborah Kwafo: Master's Thesis Accounting April 2019
67 pages
Updated Paper ID-118
No ratings yet
Updated Paper ID-118
5 pages
Video Transcript Summarizer
No ratings yet
Video Transcript Summarizer
11 pages
Expected Error Detection (New Pattern) For Upcoming Mains Exam
No ratings yet
Expected Error Detection (New Pattern) For Upcoming Mains Exam
31 pages
Nasscom-Future of Work-2024-December 2024
No ratings yet
Nasscom-Future of Work-2024-December 2024
68 pages
6th Generation Network Security
No ratings yet
6th Generation Network Security
11 pages
En WBNR New PDF Srgcm12405
No ratings yet
En WBNR New PDF Srgcm12405
17 pages
NIH Data Science Strategic Plan
No ratings yet
NIH Data Science Strategic Plan
26 pages
Learn ChatGPT - The Future of Learning (2022)
100% (1)
Learn ChatGPT - The Future of Learning (2022)
34 pages
Fake News Detection PPT 1
No ratings yet
Fake News Detection PPT 1
13 pages
Fraud Detection with Python Techniques
No ratings yet
Fraud Detection with Python Techniques
30 pages
Deep Learning in Customer Churn Prediction: Unsupervised Feature Learning On Abstract Company Independent Feature Vectors
No ratings yet
Deep Learning in Customer Churn Prediction: Unsupervised Feature Learning On Abstract Company Independent Feature Vectors
22 pages
KNN Data Imputation Explained
No ratings yet
KNN Data Imputation Explained
2 pages
What Is Cognitive Science
0% (1)
What Is Cognitive Science
6 pages
Curriculum and Syllabus - CSE - Feb 2025
No ratings yet
Curriculum and Syllabus - CSE - Feb 2025
251 pages
Delhi Public School Bangalore East Portions For Unit Test - I (2024 - 2025)
No ratings yet
Delhi Public School Bangalore East Portions For Unit Test - I (2024 - 2025)
3 pages
1 s2.0 S2667102621000061 Main
No ratings yet
1 s2.0 S2667102621000061 Main
10 pages
ISO 5338 Highlights
No ratings yet
ISO 5338 Highlights
7 pages
Artificial Intelligence Adoptionin Financial Services
No ratings yet
Artificial Intelligence Adoptionin Financial Services
18 pages
AI: Understanding Its Impact and Uses
No ratings yet
AI: Understanding Its Impact and Uses
45 pages

Data Mining Complete Notes

Uploaded by

Data Mining Complete Notes

Uploaded by

UNIT 1

Introduction to Data Mining

Background of Data Mining

(A regression is a statistical technique that relates a dependent variable to one or more

AI or Artificial intelligence is based on heuristics as opposed to statistics. It tries to apply

Machine learning is a combination of statistics and AI. It might be considered as an evolution

Difference between data mining and machine learning

S.No. Data Mining Machine Learning

Extracting useful information Introduce algorithm from data as well as from

Teaches the computer to learn and understand

Huge databases with

machine learning algorithm can be used in the

5. human interference is more in it. No human effort required after design

It is used in web Search, spam filter, fraud

Data mining abstract from the

Data mining is more of a

9. Applied in limited area Can be used in vast area

Origin Traditional databases with It has an existing algorithm and data.

Applications In compare to machine learning, data It needs a large amount of data to

Nature It involves human interference more It is automated, once designed and

Scope Applied in the limited fields. It can be used in a vast area.

Concept and benefits of data warehousing

A data warehouse, or “enterprise data warehouse” (EDW), is a central repository system in

Data warehouse benefits

Types of Data Stored in a Data Warehouse

data warehouse will store these types of data:

These types of data are discussed individually.

Characteristics of Data Warehousing

Processes in Data Warehousing

Online-Transaction processing (OLTP) : Online-Transaction Processing is a technique

Data Warehousing DWH Online transaction

Data warehouses and their architectures vary depending on the elements of an

Three common architectures are:

o Data Warehouse Architecture: Basic

Meta Data used in Data Warehouse for a variety of purpose, including:

Metadata is used to direct a query to the most appropriate data source.

End-User Access Tools

Data Warehouse Architecture: With Staging Area

Problems of Data Warehousing

Underestimation of resources for data loading

Required data not captured

Increased end-user demands

High demand for resources

The data warehouse requires large amounts of data.

Why do we need Data Mart?

Types of Data Mart

Types of Classification Techniques in Data Mining

Sequential pattern mining in data mining

6. Online Analytical Processing: Online analytical processing (OLAP) is a technique

Induction: Induction is a technique used to learn rules or patterns from data.

Online Analytical Processing: Online analytical processing (OLAP) is a technique used

Some examples of OLAP include:

Some examples of OLTP include:

Comparisons of OLAP vs OLTP:

Consists of historical data from Consists of only of operational

It makes use of a standard

It is subject-oriented. Used for

In an OLAP database, tables In an OLTP database, tables are

The data is used in planning,

It serves the purpose to extract It serves the purpose to Insert,

The size of the data is relatively

Relatively slow as the amount

The OLAP database is not The data integrity constraint must

It only need backup from time Backup and recovery process is

This data is generally managed This data is managed by clerks,

Only read and rarely write

With lengthy, scheduled batch

Nature of Process that is focused on the Process that is focused on the

Database Design with a focus on the Design that is focused on the

Improves the efficiency of

Some examples of data visualization tools include:

Some applications of data mining include:

Recent trends in data mining include:

You might also like