FDS Unit 1 Notes
FDS Unit 1 Notes
UNIT-1
DATA MINING AND DATA WAREHOUSE
Chapter 1: Data Mining
Data science is the process of using data to understand patterns, make decisions, and solve
problems. It combines skills from math, statistics, computer science, and domain knowledge to
analyze large sets of data.
Data scientists clean and organize data, find trends, and use algorithms or models to make
predictions or recommendations.
For example, in a food delivery app, it analyzes your past orders to suggest meals you might like. It
helps businesses make smarter decisions by understanding and using data effectively.
Introduction:
We live in a world where vast amounts of data are collected daily. Analysing such data is an
important need
“We are living in the information age” is a popular saying; however, we are actually living in the data
age. Terabytes (1TB = 1,024 GB) or petabytes (1 PB = 1,024 TB) of data pour into our computer
networks, the World Wide Web (WWW), and various data storage devices every day from business,
society, science and engineering, medicine, and almost every other aspect of daily life.
There is a huge amount of data available in the Information Industry. This data is of no use until it is
converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data.
Figure 1.2 We are data rich, but information poor. Figure 1.3 Data mining—searching for knowledge (interesting patterns) in your data.
Data mining is the process of extracting or “mining” knowledge (Information) from large amounts of
data or datasets using techniques such as machine learning and statistical analysis. The data can be
structured, semi-structured or unstructured, and can be stored in various forms such as databases,
data warehouses.
The goal of data mining is to extract useful information from large datasets and use it to make
predictions or inform decision-making. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining, and anomaly detection
Data mining has a wide range of applications across various industries, including marketing,
finance, healthcare, and telecommunications. Data mining is also called Knowledge Discovery from
Data (KDD). The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge presentation
KDD (Knowledge Discovery in Databases) is the process of discovering valid, novel, and useful
patterns in large datasets. It involves multiple steps like data selection, cleaning, transformation,
mining, evaluation, and interpretation to extract valuable insights that can guide decision-making.
Data mining is a stage in KDD process.
1. Data cleaning: Data cleaning is defined as removal of noisy (errors in data) and irrelevant
data from collection
2. Data integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source (Data Warehouse). Data integration using Data Migration
(transferring data from one computer to another) tools, Data Synchronization (ongoing
process of synchronizing data between two or more devices) tools.
3. Data selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Decision Trees, Clustering,
and Regression (predict numerical values) methods.
4. Data transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure
5. Data mining: Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns. An essential process where
intelligent methods are applied in order to extract data patterns (same set of data).
6. Pattern evaluation: Process of evaluating the quality of discovered patterns. To identify the
truly interesting patterns representing knowledge based on some interestingness measures
7. Knowledge presentation: Where visualization and knowledge representation techniques are
used to present the mined knowledge to the user
Steps 1 to 4 are different forms of data pre-processing, where the data are prepared for mining. The
data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user
Database, data warehouse, World Wide Web, or other information repository: Database,
World Wide Web(WWW), and data warehouse are parts of data sources. The data in these
sources may be in the form of plain text, spread sheets, or other forms of media like photos
or videos. WWW is one of the biggest sources of data.
Database or data warehouse server: The database server contains the actual data ready to
be processed. It performs the task of handling data retrieval as per the request of the user.
Knowledge base: Knowledge Base is an important part of the data mining engine that is
quite beneficial in guiding the search for the result patterns. Data mining engines may also
sometimes get inputs from the knowledge base. This knowledge base may contain data from
user experiences. The objective of the knowledge base is to make the result more accurate
and reliable.
Data mining engine: It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification, clustering,
prediction etc.
Pattern evaluation module: They are responsible for finding interesting patterns in the data
and sometimes they also interact with the database servers for producing the result of the
user requests.
User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task.
KDD is the whole process of taking raw data, cleaning it, and finding useful information from
it.
Data mining is a part of KDD, where we use special methods to find patterns or trends in the
data.
So, think of KDD as the big picture and Data Mining as a key step in finding hidden insights
2. Clusturing:
This is a technique in data mining that involves grouping similar data points together
into clusters or groups.
The aim is to identify patterns and similarities in the data, without prior knowledge
of the structure of the data or the classification of the data or the points.
Clustering can be used in a wide range of applications, example marketing
segmentation. There are various clustering algorithms available. But the most
common once include
i. K – means
ii. hierarchical clustering
iii. Density –based clustering
The quality of a clustering result depends on several factors, including the choice of
algorithm, the similarity measure used, and the number of clusters chosen.
Example: A retailer can use clustering to group customers based on their purchasing
behaviour to create marketing strategies
Example 1 Example 2
3. Regression:
Regression can be defined as a statistical modelling method in which previously
obtained data is used to predicting a continuous quantity for new observations.
Two types of Regression: 1. Linear regression, 2. Multiple regression
Regression is used in demand forecasting, price optimization
5. Text mining:
This DM technique involves analyzing & extracting useful information from
unstructured textual data such as emails, customer reviews & news articles.
This technique commonly used in topic modelling, content classification (determining
true meaning of a words).
Example: Hotel chain can use text meaning to analyze customer reviews and identify
areas for improvement in their services.
6. Neural Networks:
This technique mimics the behavior of the human brain in processing information.
A Neural networks consists of interconnected nodes or “neurons” that process
information.
These neurons organized into layers, with each layer responsible for a specific aspect
of the computation.
The input layer receives the input data, and the output layer produce the output of
the network.
The layers between the input and output layers are called “hidden” layers and are
responsible for computations.
Neural networks have several advantages: Image recognition, speech recognition etc.
Example: Self driving car can use neural networks to identify and respond to different traffic
conditions
Data Quality: Data mining relies on the quality of the input data. Inaccurate, incomplete or
noisy data can lead to misleading results and difficult to discover meaningful patterns.
Data Complexity: Complex datasets with different structures, including unstructured data
like text, images are significant challenges in terms of processing, integration and analysis.
Data Privacy and Security: Data privacy and security is another significant challenge in data
mining. As more data is collected, stored and analyzed, the risk of cyber-attacks increases.
The data may contain personal, sensitive or confidential information that must be protected.
Scalability: Data mining algorithms must be scalable to handle large datasets efficiently. As
the size of the dataset increases, the time and computational resources required to perform
data mining operations also increases.
Interpretability: Data mining algorithms can produce complex models that are difficult to
interpret. This is because algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationship in the data.
2. Performance Issues:
There can be performance-related issues such as follows –
Efficiency and scalability of data mining algorithms − In order to effectively extract
the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
Media: Media channels like radio, television and over-the-top (OTT) platforms keep track of
their audience to understand consumption patterns. Using this information, media providers
make content recommendations, change program schedules.
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
OR
Data Warehouse is a specialized system or database used to store and manage large amounts of
historical data from multiple sources. It is designed to help in the efficient retrieval and analysis of
data for reporting, querying, and decision-making.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc.
Data Mining is about digging into data to find insights, patterns, or predictions.
Data Warehouse is a centralized system for storing large amounts of data so it can be
easily accessed and analyzed.
Essentially, Data Mining helps you analyze the data, while a Data Warehouse helps you store and
organize it.
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing
data of TB size, the storage shifted to the Data Warehouse
Subject-oriented: A data warehouse can be used to analyse a particular subject area. For
example, “sales” can be a particular subject
Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data form a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a
data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change, so, historical data in a
data warehouse should never be altered.
Business User: To view historical data that has been summarized, business users need access
to data warehouse.
Archive historical data: Historical time-variable data must be stored in data warehouse.
Make strategic choices: Depending on the information in the warehouse, some tactics may
be implemented. Thus, data warehouse aid in the process of making strategic choices.
For data quality and consistency: By combining data from several sources into one location,
the user can work efficiently work to improve the data.
Relational Database System (RDBMS): The bottom tier typically consists of a relational
database system where the data is stored. This is where all the raw data from various
sources is accumulated and managed.
Back-End Tools and Utilities: This layer includes tools used for extracting, cleaning,
transforming, and loading data into the warehouse. These tools ensure that data from
operational databases or external sources (like customer profiles from consultants) are
processed and integrated into the warehouse.
Data Extraction: Data is extracted from different sources using specific tools.
Data Cleaning: Irrelevant or noisy data is removed to maintain quality
Data Transformation: Data from different sources are converted into a unified format.
Data Loading and Refreshing: Periodic updates are made to ensure the warehouse holds the
most current data.
Gateways: Data is extracted using application program interfaces (APIs) called gateways.
These gateways allow client programs to send SQL queries to be executed by the database
server.
Examples of Gateways:
ODBC (Open Database Connectivity): Standard for database access.
OLEDB (Object Linking and Embedding for Databases): A Microsoft-specific API for
database access.
JDBC (Java Database Connectivity): Java-specific API for database access.
Metadata Repository: This component stores detailed information about the data
warehouse structure, such as the data sources, schema, and data transformation rules.
(Metadata is information about the data, like its name, type (e.g., numbers, text), or what
values it can have.)
OLAP Server: The middle tier consists of an Online Analytical Processing (OLAP) server, which
helps with the analysis and multidimensional queries of the data.
Relational OLAP (ROLAP): In this model, an extended relational database
management system (DBMS) maps multidimensional operations (like aggregation)
to standard relational database operations (like SQL queries).
Multidimensional OLAP (MOLAP): This model uses a special-purpose server that
directly implements multidimensional data and operations, optimized for fast data
analysis and aggregation.
OLAP Operations: The OLAP server handles operations like:
Roll-up: Aggregating data along a hierarchy (e.g., summing sales by region).
Drill-down: Breaking data down into finer details (e.g., viewing sales at the store
level).
Slice and Dice: Viewing data from different perspectives or dimensions.
Client Tools: The top tier is where end-users interact with the data warehouse. This layer
contains various client tools for querying and reporting the data.
Query and Reporting Tools: Tools like SQL query builders and reporting applications
allow users to extract information from the data warehouse in a structured format.
Analysis Tools: These tools help in deeper data analysis, such as identifying trends,
patterns, or performing statistical analysis.
Data Mining Tools: These tools apply algorithms to the data to predict future trends
or uncover hidden patterns, such as predictive modeling and trend analysis (e.g.,
identifying which products are likely to be popular in the future).
Bottom Tier: Data storage (relational database) with data extraction, cleaning,
transformation, and loading tools, plus a metadata repository.
Middle Tier: OLAP servers (ROLAP or MOLAP) for multidimensional data analysis.
Top Tier: Client tools for querying, reporting, analysis, and data mining.
From the architecture point of view, there are three data warehouse models or Types of Data
Warehouses Models
1. Enterprise Warehouse
An Enterprise database brings together various functional areas of an organization
and brings them together in a unified manner
An enterprise data warehouse structures and stores all company's business data for
analytics querying and reporting.
It collects all of the information about subjects spanning the entire organization. The
goal of the Enterprise data warehouse is to provide a complete overview of any
particular object in the data model.
It mainly contains detailed summarized information and can range from a few
gigabytes to hundreds of gigabytes, terabytes, or maybe beyond.
2. Data Mart
It is a data store designed for a particular department of an organization or
company.
Data Mart is a subset of the data warehouse usually oriented to a specific task.
Data that we use for a particular department or purpose is called data mart.
Reason for creating a data mart
Easy access of frequently used data
It improves end-user response time
It can be easily creation of data mart
Less cost in building a data mart.
3. Virtual warehouse
A virtual data warehouse gives you a quick overview of your data. It has metadata
(data which provides information about other data) in it.
It connects to several data sources with the use of middleware
A virtual warehouse is easy to set up, but it requires more database server capacity.
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multidimensional Data
Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about
the range of data which can be gained with the selected technology and collect the complete data
in detail.
Stage 2: Grouping different segments of the system: In the second stage, the Multi-Dimensional
Data Model recognizes and classifies all the data to the respective section they belong to and also
builds it problem-free to apply step by step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the design
of the system is based. In this stage, the main factors are recognized according to the user’s point
of view. These factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities: In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors
which are collected by it. These actually play a significant role in the arrangement of a Multi-
Dimensional Data Model.
Stage 6: Building the Schema to place the data, with respect to the information collected from
the steps above: In the sixth stage, on the basis of the data which was collected previously, a
Schema is built.
Let us take the example of the data of a factory which sells products per quarter in Bangalore. The
data is represented in the table given below:
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which
is organized into quarters and the dimension of items, which is sorted according to the kind of item
which is sold. The facts here are represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented in
the diagram given below. Here the data of the sales is represented as a two dimensional table. Let us
consider the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here is the
table:
This data can be represented in the form of three dimensions conceptually, which is shown in the
image below:
OLAP is a technology used in data warehousing to enable users to analyze and view data
from multiple perspectives.
It helps in making complex analytical queries fast and efficient.
OLAP systems are designed to perform complex queries on large volumes of data in a multi-
dimensional way, meaning that data can be analyzed across different dimensions (e.g., time,
geography, products, etc.).
OLAP allows data to be represented in multiple dimensions, such as time, location, or product
categories.
Data Summarization: OLAP can summarize large amounts of detailed data into aggregated
reports.
To facilitate this kind of analysis, data is collected from multiple sources and stored in data
warehouses, then cleansed and organized into data cubes.
Each OLAP cube contains data categorized by dimensions (such as customers, geographic
sales region and time period) derived by dimensional tables in the data warehouses.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down (Roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
It can be implemented by either stepping down a concept hierarchy for a dimension
Adding additional dimensions to the hypercube
2. Roll up: It is the opposite of the drill-down operation and is also known as a drill-up or
aggregation operation. It is a dimension-reduction technique that performs aggregation on a
data cube. It makes the data less detailed and it can be performed by combining similar
dimensions across any axis. (City -> Country).
3. Dice: Dice operation is used to generate a new sub-cube from the existing hypercube. It
selects two or more dimensions from the hypercube to generate a new sub-cube for the
given data.
4. Slice: Slice operation is used to select a single dimension from the given cube to generate a
new sub-cube. It represents the information from another point of view.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.
When working with datasets like sales and customer data, missing values for certain attributes (e.g.,
customer income) are common. Here are several ways to handle missing values:
1. Ignore the Tuple: You can simply ignore the record (tuple) that has missing data, especially if
it's missing a class label (e.g., for classification tasks). However, this method isn't effective if
many values are missing or if the missing values vary across attributes.
2. Fill in the Missing Value Manually: Manually entering the missing value, which could be
time-consuming and impractical for large datasets with many missing entries.
3. Use a Global Constant to Fill in the Missing Value: You can replace missing values with a
constant (e.g., “Unknown” or −∞). This is easy but may mislead the mining algorithm, as it
might treat “Unknown” as a meaningful value.
4. Use the Attribute Mean to Fill in the Missing Value: Replace the missing value with the
average value of that attribute across the dataset. For example, if the average income is
$56,000, missing income values are replaced with $56,000.
5. Use the Attribute Mean for the Same Class: Instead of using the global mean, replace the
missing value with the mean for that attribute, but only for the same class. For instance, if
you are classifying customers by credit risk, use the average income of customers within the
same credit risk category to fill in the missing income value.
6. Use the Most Probable Value to Fill in the Missing Value: Predict the missing value based on
other attributes using techniques like regression, decision trees, or Bayesian inference. For
example, a decision tree could predict the missing income value based on other customer
characteristics. This method uses the most data and often provides more accurate results.
1. Binning: This strategy is fairly easy to comprehend. The values nearby are used to smooth
the sorted data. The information is subsequently split into several equal-sized parts. The
various techniques are then used to finish the assignment.
2. Regression: With the use of the regression function, the data is smoothed out. Regression
may be multivariate or linear. Multiple regressions have more independent variables than
linear regressions, which only have one.
3. Clustering: Groups similar data points into clusters. Values that fall outside these clusters
may be considered outliers. To detect outliers by identifying data points that don't belong to
any cluster.
Data Integration
Data Integration is the process of combining data from different sources (e.g., databases,
spreadsheets, or files) into a single, unified dataset.
This is often done to make the data easier to analyze, so businesses or organizations can get
a complete picture of their data in one place.
Example: Imagine you have customer data in one database, product data in another, and
sales data in a third. Data integration helps to bring all these together so you can see the
relationship between customers, products, and sales.
While performing data integration, you must work on data redundancy, inconsistency,
duplicity, etc
1. Schema Integration:
Different data sources may use different names or structures for the same thing
Example: One database might call a customer’s ID "customer_id," and another
database might call it "cust_number." Schema integration helps us figure out that
both refer to the same thing.
2. Redundancy Detection:
Sometimes, we find duplicate data in different places. For example, one database
may store a customer’s address in two places. This can lead to confusion or errors.
We need to find and remove these duplicates to avoid redundancy.
Example: If two records in a database say the same thing about a customer (e.g.,
same name, same address), that's redundant data.
3. Data value conflicts:
When we combine data from different sources, sometimes the values for the same
attribute don’t match. These are called value conflicts.
Example: One source might list the price of a product in USD, and another might list
it in a different currency. We need to convert the prices into the same currency for
consistency.
Data Transformation
Data transformation is the process of changing the format, structure, or values of data to
make it suitable for analysis or mining.
It helps in making raw data more useful for extracting patterns and knowledge.
1. Data Smoothing:
Smoothing means removing noise (unwanted or irrelevant data) from the dataset to
make it easier to analyze.
Techniques like binning, regression, and clustering are used to clean and smooth
data by removing inconsistencies and errors.
Example:
In daily sales data, some sales might be recorded incorrectly. Smoothing helps
eliminate these errors, making the data more reliable for analysis.
2. Aggregation
Aggregation involves combining multiple data points into a summary form, such as
calculating totals or averages. This simplifies the data and helps in analyzing broader
trends.
Example:
Daily sales data for the year could be aggregated to calculate monthly or annual
sales totals, which makes it easier to analyze sales trends over time.
3. Generalization
It replaces specific, detailed data with broader categories or concepts to simplify the
dataset. This helps in recognizing patterns at a higher level.
Example:
Street names can be generalized to a city or country level, and specific ages can be
generalized into age groups like youth, middle-aged, and senior.
4. Normalization
Normalization is the process of adjusting data so that it fits within a specific range,
making it consistent across different datasets. This is often needed for machine
learning and analysis.
Example:
Normalizing income data ranging from $12,000 to $98,000 so that it fits within a
scale of 0 to 1 helps ensure that other attributes, like age, don't dominate the
analysis because of their larger range.
Types of Normalization:
Min-max normalization
Scales values to a new range, such as 0 to 1.
Example: A value of $73,600 for income would be transformed to fit
within a range of 0.0 to 1.0.
Z-score normalization
Adjusts data based on the mean and standard deviation of the
dataset.
In Summary:
Data Reduction
When you work with large datasets, the data can become so big and complex that analyzing
it takes too long or even becomes impossible.
Data reduction helps by making the dataset smaller but still keeps the important
information intact.
3. Dimensionality Reduction
This involves reducing the number of features or attributes used to represent the
data without losing important information. Methods like Principal Component
Analysis (PCA) are often used here.
Example:
If you have 100 different attributes (e.g., customer age, income, spending habits),
dimensionality reduction might reduce them to 5 or 10 key components, helping
make the data easier to process.
4. Numerosity Reduction
This technique involves replacing the actual data with simpler models or
estimations that require less space to store.
Parametric Models: These represent the data using only the essential parameters
(instead of storing all raw data).
Clustering or Sampling: Using representative data points (instead of the entire
dataset) to analyze.
Histograms: Representing data as a summary of its frequency distribution.
Example:
Rather than storing every single sales transaction, you could represent sales data
using a parametric model that describes the overall sales pattern using just a few
numbers (like the average and variance).
Data reduction is about making the data smaller while keeping the important information. It helps
speed up analysis and improves efficiency, making it easier to work with large datasets.
Data Discretization
Data discretization is the process of reducing a continuous attribute (like age, price, or
temperature) into intervals or ranges.
This simplifies the data by replacing many specific values with few general labels (e.g.,
converting age values into age groups like youth, middle-aged, and senior).
Example:
Instead of recording ages like 22, 33, 45, etc., you might group them into categories like 18-
25, 26-40, and 41-60.
The range of the attribute is divided into intervals, and then each value is replaced by its
interval label.
o Before discretization: Age = 33
o After discretization: Age = "26-40"
1. Top-Down (Splitting):
o This method starts by splitting the entire range of data into some initial
intervals (or "cut points"), and then keeps dividing the intervals further.
o Example: Starting with age as 0-100 years, then splitting into smaller
intervals like 0-30, 31-60, 61-100.
2. Bottom-Up (Merging):
o This method starts with all individual values and then merges similar values
together into larger intervals.
o Example: If the data has many age values like 22, 23, 24, it may merge them
into a single group, like "20-25."
Assignment Questions
2 marks
2. Define KDD
3. Explain clustering
5 marks
10 marks