KEMBAR78
FDS Unit 1 Notes | PDF | Data Warehouse | Databases
0% found this document useful (0 votes)
13 views30 pages

FDS Unit 1 Notes

Data mining is a crucial process in data science that involves extracting valuable information from large datasets using techniques like machine learning and statistical analysis. The Knowledge Discovery in Databases (KDD) process encompasses several steps including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Challenges in data mining include data quality, complexity, privacy concerns, and the need for scalable algorithms to handle large datasets.

Uploaded by

Ummulwara Hanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

FDS Unit 1 Notes

Data mining is a crucial process in data science that involves extracting valuable information from large datasets using techniques like machine learning and statistical analysis. The Knowledge Discovery in Databases (KDD) process encompasses several steps including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Challenges in data mining include data quality, complexity, privacy concerns, and the need for scalable algorithms to handle large datasets.

Uploaded by

Ummulwara Hanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

UNIT-1
DATA MINING AND DATA WAREHOUSE
Chapter 1: Data Mining
Data science is the process of using data to understand patterns, make decisions, and solve
problems. It combines skills from math, statistics, computer science, and domain knowledge to
analyze large sets of data.

Data scientists clean and organize data, find trends, and use algorithms or models to make
predictions or recommendations.

For example, in a food delivery app, it analyzes your past orders to suggest meals you might like. It
helps businesses make smarter decisions by understanding and using data effectively.

Introduction:
We live in a world where vast amounts of data are collected daily. Analysing such data is an
important need

Moving towards the Information Age:

“We are living in the information age” is a popular saying; however, we are actually living in the data
age. Terabytes (1TB = 1,024 GB) or petabytes (1 PB = 1,024 TB) of data pour into our computer
networks, the World Wide Web (WWW), and various data storage devices every day from business,
society, science and engineering, medicine, and almost every other aspect of daily life.

There is a huge amount of data available in the Information Industry. This data is of no use until it is
converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.

What Is Data Mining?

Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data.

Figure 1.2 We are data rich, but information poor. Figure 1.3 Data mining—searching for knowledge (interesting patterns) in your data.

KLES JT BCA COLLEGE, GADAG 1|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data mining is the process of extracting or “mining” knowledge (Information) from large amounts of
data or datasets using techniques such as machine learning and statistical analysis. The data can be
structured, semi-structured or unstructured, and can be stored in various forms such as databases,
data warehouses.

The goal of data mining is to extract useful information from large datasets and use it to make
predictions or inform decision-making. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining, and anomaly detection

Data mining has a wide range of applications across various industries, including marketing,
finance, healthcare, and telecommunications. Data mining is also called Knowledge Discovery from
Data (KDD). The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge presentation

The Knowledge Discovery in Database(KDD) process is shown in Figure 1.4 is as an iterative


sequence of the following steps.

KDD (Knowledge Discovery in Databases) is the process of discovering valid, novel, and useful
patterns in large datasets. It involves multiple steps like data selection, cleaning, transformation,
mining, evaluation, and interpretation to extract valuable insights that can guide decision-making.
Data mining is a stage in KDD process.

KLES JT BCA COLLEGE, GADAG 2|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

1. Data cleaning: Data cleaning is defined as removal of noisy (errors in data) and irrelevant
data from collection
2. Data integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source (Data Warehouse). Data integration using Data Migration
(transferring data from one computer to another) tools, Data Synchronization (ongoing
process of synchronizing data between two or more devices) tools.
3. Data selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Decision Trees, Clustering,
and Regression (predict numerical values) methods.
4. Data transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure
5. Data mining: Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns. An essential process where
intelligent methods are applied in order to extract data patterns (same set of data).
6. Pattern evaluation: Process of evaluating the quality of discovered patterns. To identify the
truly interesting patterns representing knowledge based on some interestingness measures
7. Knowledge presentation: Where visualization and knowledge representation techniques are
used to present the mined knowledge to the user

Steps 1 to 4 are different forms of data pre-processing, where the data are prepared for mining. The
data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user

Figure 1.5 Architecture of a typical data mining system.

KLES JT BCA COLLEGE, GADAG 3|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

 Database, data warehouse, World Wide Web, or other information repository: Database,
World Wide Web(WWW), and data warehouse are parts of data sources. The data in these
sources may be in the form of plain text, spread sheets, or other forms of media like photos
or videos. WWW is one of the biggest sources of data.
 Database or data warehouse server: The database server contains the actual data ready to
be processed. It performs the task of handling data retrieval as per the request of the user.
 Knowledge base: Knowledge Base is an important part of the data mining engine that is
quite beneficial in guiding the search for the result patterns. Data mining engines may also
sometimes get inputs from the knowledge base. This knowledge base may contain data from
user experiences. The objective of the knowledge base is to make the result more accurate
and reliable.
 Data mining engine: It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification, clustering,
prediction etc.
 Pattern evaluation module: They are responsible for finding interesting patterns in the data
and sometimes they also interact with the database servers for producing the result of the
user requests.
 User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task.

Difference between KDD and Data Mining

Summary in Simple Terms:

 KDD is the whole process of taking raw data, cleaning it, and finding useful information from
it.
 Data mining is a part of KDD, where we use special methods to find patterns or trends in the
data.

So, think of KDD as the big picture and Data Mining as a key step in finding hidden insights

KLES JT BCA COLLEGE, GADAG 4|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

DBMS Vs Data Mining


Aspect DBMS Data Mining
Definition A software system used to manage, store, The process of discovering patterns, trends, and
and retrieve data in databases. useful information from large sets of data.
Primary Data storage, retrieval, and manipulation. Extracting hidden patterns and knowledge from
Focus data.
Functionality Manages databases, queries, Uses statistical algorithms and machine learning
transactions, and ensures data integrity. to uncover insights in data.
Tools Relational DBMS, NoSQL DBMS, SQL- Data mining tools, such as clustering,
based systems. association, and classification algorithms.
Data Type Structured data, stored in tables (e.g., Both structured and unstructured data.
relational databases).
Objective Efficient storage and management of Identify useful patterns, relationships, or trends
data. in data.
Process Transaction-based (CRUD operations). Knowledge discovery-based (Pattern
recognition, classification).
Time Generally optimized for fast data retrieval Can be computationally intensive depending on
Complexity and updates the algorithm used.
Output Data in the form of tables, views, or Patterns, rules, clusters, or predictions
reports. extracted from data.
Scope Mainly focused on day-to-day data Focuses on long-term analysis and decision-
management. making based on data trends.

Data Mining(DM) Techniques


1. Classification:
 This is one of the most widely used technique in data mining, which involves
identification of patterns in data and labeling of data into predefined classes or
categories
 In simple terms classification is the process of assigning a given data point to a
category or class based on a set of features or attributes. Classification algorithms
are used to build predictive models that can be used to classify new data based on
their features
 Used in fraud detection and customer segmentation.
 Example: A bank can use classification to identify fraudulent transactions based on a
set of predefined attributes such as transaction amount, location and time.

KLES JT BCA COLLEGE, GADAG 5|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

2. Clusturing:
 This is a technique in data mining that involves grouping similar data points together
into clusters or groups.
 The aim is to identify patterns and similarities in the data, without prior knowledge
of the structure of the data or the classification of the data or the points.
 Clustering can be used in a wide range of applications, example marketing
segmentation. There are various clustering algorithms available. But the most
common once include
i. K – means
ii. hierarchical clustering
iii. Density –based clustering
 The quality of a clustering result depends on several factors, including the choice of
algorithm, the similarity measure used, and the number of clusters chosen.
 Example: A retailer can use clustering to group customers based on their purchasing
behaviour to create marketing strategies

Example 1 Example 2

3. Regression:
 Regression can be defined as a statistical modelling method in which previously
obtained data is used to predicting a continuous quantity for new observations.
 Two types of Regression: 1. Linear regression, 2. Multiple regression
 Regression is used in demand forecasting, price optimization

KLES JT BCA COLLEGE, GADAG 6|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

4. Association Rule Mining:


 This data mining technique is used to identify patterns or associations among
variables in a large dataset.
 Here the goal of association rule mining is to discover interesting and meaningful
relationship between variables that can be used to make decisions. Used in markets-
basket analysis [ Bread-milk].

5. Text mining:
 This DM technique involves analyzing & extracting useful information from
unstructured textual data such as emails, customer reviews & news articles.
 This technique commonly used in topic modelling, content classification (determining
true meaning of a words).
 Example: Hotel chain can use text meaning to analyze customer reviews and identify
areas for improvement in their services.

6. Neural Networks:
 This technique mimics the behavior of the human brain in processing information.
 A Neural networks consists of interconnected nodes or “neurons” that process
information.
 These neurons organized into layers, with each layer responsible for a specific aspect
of the computation.
 The input layer receives the input data, and the output layer produce the output of
the network.
 The layers between the input and output layers are called “hidden” layers and are
responsible for computations.

KLES JT BCA COLLEGE, GADAG 7|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

 Neural networks have several advantages: Image recognition, speech recognition etc.
 Example: Self driving car can use neural networks to identify and respond to different traffic
conditions

Problems in data mining:


1. Poor data quality such as noisy data and insufficient data size.
2. Integrating conflicting data from different sources and forms: multimedia files (audio, video,
images), text, numeric, etc.
3. Security and privacy concerns by individuals, organizations and governments.
4. Unavailability of data or difficult to access.
5. Dealing with huge datasets that require distributed approaches.
6. Dealing with non-static data.
7. Mining information from heterogeneous databases.
8. High cost of buying and maintaining powerful softwares, servers and storage hardwares that
handle large amounts of data.
9. Processing of large, complex and unstructured data into a structured format.

KLES JT BCA COLLEGE, GADAG 8|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Challenges of Data mining:


In this section, we will explore various challenges of data mining, as mentioned below:

 Data Quality: Data mining relies on the quality of the input data. Inaccurate, incomplete or
noisy data can lead to misleading results and difficult to discover meaningful patterns.
 Data Complexity: Complex datasets with different structures, including unstructured data
like text, images are significant challenges in terms of processing, integration and analysis.
 Data Privacy and Security: Data privacy and security is another significant challenge in data
mining. As more data is collected, stored and analyzed, the risk of cyber-attacks increases.
The data may contain personal, sensitive or confidential information that must be protected.
 Scalability: Data mining algorithms must be scalable to handle large datasets efficiently. As
the size of the dataset increases, the time and computational resources required to perform
data mining operations also increases.
 Interpretability: Data mining algorithms can produce complex models that are difficult to
interpret. This is because algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationship in the data.

Data Mining – Issues


Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues.

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues

1. Mining Methodology and User Interaction Issues:


It refers to the following kinds of issues –
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for
patterns. The data mining process should be interactive because it is difficult to know
what can be discovered within a database.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages and visual
representations (Graphs, Pie charts, Bar charts, etc)
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack of novelty (expected data or values).

KLES JT BCA COLLEGE, GADAG 9|Page


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

2. Performance Issues:
There can be performance-related issues such as follows –
 Efficiency and scalability of data mining algorithms − In order to effectively extract
the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged.

3. Diverse Data Types Issues:


 Handling of relational and complex types of data − The database may contain
complex data objects, videos, text, images, audios etc. It is not possible for one
system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems− The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.

Applications of Data Mining:


 Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes
data collected from nuclear laboratories, data about human psychology, etc. Data mining
techniques are capable of the analysis of these data. Now we can capture and store more
new data faster than we can analyse the old data already accumulated.
 Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital
network. Network intrusions often involve stealing valuable network resources. Data mining
technique plays a vital role in searching intrusion detection, network attacks, and anomalies.
These techniques help in selecting and refining useful and relevant information from large
data sets.
 Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale
by the companies and data mining techniques helps to achieve this analysis task
 Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:
 Predicting students admission in higher education
 Predicting students profiling
 Predicting student performance
 Curriculum development
 Predicting student placement opportunities
 Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies,
identify behavior patterns of risky customers and identify fraudulent behavior of customers.

KLES JT BCA COLLEGE, GADAG 10 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

 Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
 Media: Media channels like radio, television and over-the-top (OTT) platforms keep track of
their audience to understand consumption patterns. Using this information, media providers
make content recommendations, change program schedules.

KLES JT BCA COLLEGE, GADAG 11 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Chapter 2: Data Warehouse

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.

OR

Data Warehouse is a specialized system or database used to store and manage large amounts of
historical data from multiple sources. It is designed to help in the efficient retrieval and analysis of
data for reporting, querying, and decision-making.

A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc.

Data Mining vs data Warehouse

 Data Mining is about digging into data to find insights, patterns, or predictions.
 Data Warehouse is a centralized system for storing large amounts of data so it can be
easily accessed and analyzed.

Essentially, Data Mining helps you analyze the data, while a Data Warehouse helps you store and
organize it.

KLES JT BCA COLLEGE, GADAG 12 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Need for Data Warehouse

An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing
data of TB size, the storage shifted to the Data Warehouse

Characteristics/key Features of Data Warehouse

 Subject-oriented: A data warehouse can be used to analyse a particular subject area. For
example, “sales” can be a particular subject
 Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
 Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data form a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a
data warehouse can hold all addresses associated with a customer.
 Non-volatile: Once data is in the data warehouse, it will not change, so, historical data in a
data warehouse should never be altered.

Date Warehouse Requirement:

 Business User: To view historical data that has been summarized, business users need access
to data warehouse.
 Archive historical data: Historical time-variable data must be stored in data warehouse.
 Make strategic choices: Depending on the information in the warehouse, some tactics may
be implemented. Thus, data warehouse aid in the process of making strategic choices.
 For data quality and consistency: By combining data from several sources into one location,
the user can work efficiently work to improve the data.

Advantages of Data Warehouse

Data warehouses facilitate end users access to a variety of data


Assist in the operation of applications for decision support systems such as obtaining the
products that have sold the most in a specific area over the past two years
For the medium and long term, it is especially helpful.
Allows for easier corporate decision-making.
The productivity of businesses rises.
One effective way to handle the need for large amounts of information from numerous
consumers is through data warehousing.

KLES JT BCA COLLEGE, GADAG 13 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

A Three-Tier Data Warehouse Architecture

1. Bottom Tier (Warehouse Database Server)

 Relational Database System (RDBMS): The bottom tier typically consists of a relational
database system where the data is stored. This is where all the raw data from various
sources is accumulated and managed.
 Back-End Tools and Utilities: This layer includes tools used for extracting, cleaning,
transforming, and loading data into the warehouse. These tools ensure that data from
operational databases or external sources (like customer profiles from consultants) are
processed and integrated into the warehouse.
 Data Extraction: Data is extracted from different sources using specific tools.
 Data Cleaning: Irrelevant or noisy data is removed to maintain quality
 Data Transformation: Data from different sources are converted into a unified format.
 Data Loading and Refreshing: Periodic updates are made to ensure the warehouse holds the
most current data.
 Gateways: Data is extracted using application program interfaces (APIs) called gateways.
These gateways allow client programs to send SQL queries to be executed by the database
server.
 Examples of Gateways:
 ODBC (Open Database Connectivity): Standard for database access.

KLES JT BCA COLLEGE, GADAG 14 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

 OLEDB (Object Linking and Embedding for Databases): A Microsoft-specific API for
database access.
 JDBC (Java Database Connectivity): Java-specific API for database access.
 Metadata Repository: This component stores detailed information about the data
warehouse structure, such as the data sources, schema, and data transformation rules.
(Metadata is information about the data, like its name, type (e.g., numbers, text), or what
values it can have.)

2. Middle Tier (OLAP Server)

 OLAP Server: The middle tier consists of an Online Analytical Processing (OLAP) server, which
helps with the analysis and multidimensional queries of the data.
 Relational OLAP (ROLAP): In this model, an extended relational database
management system (DBMS) maps multidimensional operations (like aggregation)
to standard relational database operations (like SQL queries).
 Multidimensional OLAP (MOLAP): This model uses a special-purpose server that
directly implements multidimensional data and operations, optimized for fast data
analysis and aggregation.
 OLAP Operations: The OLAP server handles operations like:
 Roll-up: Aggregating data along a hierarchy (e.g., summing sales by region).
 Drill-down: Breaking data down into finer details (e.g., viewing sales at the store
level).
 Slice and Dice: Viewing data from different perspectives or dimensions.

3. Top Tier (Front-End Client Layer)

 Client Tools: The top tier is where end-users interact with the data warehouse. This layer
contains various client tools for querying and reporting the data.
 Query and Reporting Tools: Tools like SQL query builders and reporting applications
allow users to extract information from the data warehouse in a structured format.
 Analysis Tools: These tools help in deeper data analysis, such as identifying trends,
patterns, or performing statistical analysis.
 Data Mining Tools: These tools apply algorithms to the data to predict future trends
or uncover hidden patterns, such as predictive modeling and trend analysis (e.g.,
identifying which products are likely to be popular in the future).

Summary of Three-Tier Architecture:

 Bottom Tier: Data storage (relational database) with data extraction, cleaning,
transformation, and loading tools, plus a metadata repository.
 Middle Tier: OLAP servers (ROLAP or MOLAP) for multidimensional data analysis.
 Top Tier: Client tools for querying, reporting, analysis, and data mining.

KLES JT BCA COLLEGE, GADAG 15 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

From the architecture point of view, there are three data warehouse models or Types of Data
Warehouses Models

1. Enterprise Warehouse
 An Enterprise database brings together various functional areas of an organization
and brings them together in a unified manner
 An enterprise data warehouse structures and stores all company's business data for
analytics querying and reporting.
 It collects all of the information about subjects spanning the entire organization. The
goal of the Enterprise data warehouse is to provide a complete overview of any
particular object in the data model.
 It mainly contains detailed summarized information and can range from a few
gigabytes to hundreds of gigabytes, terabytes, or maybe beyond.

2. Data Mart
 It is a data store designed for a particular department of an organization or
company.
 Data Mart is a subset of the data warehouse usually oriented to a specific task.
 Data that we use for a particular department or purpose is called data mart.
Reason for creating a data mart
 Easy access of frequently used data
 It improves end-user response time
 It can be easily creation of data mart
 Less cost in building a data mart.

3. Virtual warehouse
 A virtual data warehouse gives you a quick overview of your data. It has metadata
(data which provides information about other data) in it.
 It connects to several data sources with the use of middleware
 A virtual warehouse is easy to set up, but it requires more database server capacity.

Multidimensional Data Model


o The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
o The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow
customers to access data in the form of queries.
o They allow users to rapidly receive answers to the requests which they made by
creating and examining the data comparatively fast.
o OLAP (online analytical processing) and data warehousing uses multidimensional
databases. It is used to show multiple dimensions of the data to users.
o It represents data in the form of data cubes. Data cubes allow to model and view the data
from many dimensions and perspectives. It is defined by dimensions and facts and is
represented by a fact table.
o Facts are numerical measures and fact tables contain measures of the related dimensional
tables or names of the facts.

KLES JT BCA COLLEGE, GADAG 16 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multidimensional Data
Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about
the range of data which can be gained with the selected technology and collect the complete data
in detail.
Stage 2: Grouping different segments of the system: In the second stage, the Multi-Dimensional
Data Model recognizes and classifies all the data to the respective section they belong to and also
builds it problem-free to apply step by step.

Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the design
of the system is based. In this stage, the main factors are recognized according to the user’s point
of view. These factors are also known as “Dimensions”.

Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as “attributes” in the database.

Stage 5: Finding the actuality of factors which are listed previously and their qualities: In the fifth
stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the factors
which are collected by it. These actually play a significant role in the arrangement of a Multi-
Dimensional Data Model.

Stage 6: Building the Schema to place the data, with respect to the information collected from
the steps above: In the sixth stage, on the basis of the data which was collected previously, a
Schema is built.

KLES JT BCA COLLEGE, GADAG 17 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Let us take the example of the data of a factory which sells products per quarter in Bangalore. The
data is represented in the table given below:

In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which
is organized into quarters and the dimension of items, which is sorted according to the kind of item
which is sold. The facts here are represented in rupees (in thousands).

Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented in
the diagram given below. Here the data of the sales is represented as a two dimensional table. Let us
consider the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here is the
table:

This data can be represented in the form of three dimensions conceptually, which is shown in the
image below:

KLES JT BCA COLLEGE, GADAG 18 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Features of multidimensional data models:


 Measures: Measures are numerical data that can be analyzed and compared, such as sales
or revenue. They are typically stored in fact tables in a multidimensional data model.
 Dimensions: Dimensions are attributes that describe the measures, such as time, location, or
product. They are typically stored in dimension tables in a multidimensional data model.
 Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to retrieve
and analyze data.
 Aggregation: Aggregation is the process of summarizing data across dimensions and levels
of detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity
 Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of
data to a lower level of detail, while roll-up is the opposite process of moving from a lower-
level detail to a higher level summary. These features enable users to explore data in greater
detail and gain insights into the underlying patterns.
 Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For
example, a time dimension might be organized into years, quarters, months, and days.
Hierarchies provide a way to navigate the data and perform drill-down and roll-up
operations.

Advantages of Multi-Dimensional Data Model


 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.

Disadvantages of Multi-Dimensional Data Model

 The multi-dimensional Data Model is slightly complicated in nature and it requires


professionals to recognize and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system caches, there is a
great effect on the working of the system
 It is complicated in nature due to which the databases are generally dynamic in design
 As the Multi-Dimensional Data Model has complicated systems, databases have a large
number of databases due to which the system is very insecure when there is a security
break.

KLES JT BCA COLLEGE, GADAG 19 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

OLAP (OLline Analytical Processing)

 OLAP is a technology used in data warehousing to enable users to analyze and view data
from multiple perspectives.
 It helps in making complex analytical queries fast and efficient.
 OLAP systems are designed to perform complex queries on large volumes of data in a multi-
dimensional way, meaning that data can be analyzed across different dimensions (e.g., time,
geography, products, etc.).
 OLAP allows data to be represented in multiple dimensions, such as time, location, or product
categories.
 Data Summarization: OLAP can summarize large amounts of detailed data into aggregated
reports.

How OLAP systems work

 To facilitate this kind of analysis, data is collected from multiple sources and stored in data
warehouses, then cleansed and organized into data cubes.
 Each OLAP cube contains data categorized by dimensions (such as customers, geographic
sales region and time period) derived by dimensional tables in the data warehouses.

OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:

1. Drill down (Roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
 It can be implemented by either stepping down a concept hierarchy for a dimension
 Adding additional dimensions to the hypercube

(Quarter -> Month)

KLES JT BCA COLLEGE, GADAG 20 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

2. Roll up: It is the opposite of the drill-down operation and is also known as a drill-up or
aggregation operation. It is a dimension-reduction technique that performs aggregation on a
data cube. It makes the data less detailed and it can be performed by combining similar
dimensions across any axis. (City -> Country).

3. Dice: Dice operation is used to generate a new sub-cube from the existing hypercube. It
selects two or more dimensions from the hypercube to generate a new sub-cube for the
given data.

KLES JT BCA COLLEGE, GADAG 21 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

4. Slice: Slice operation is used to select a single dimension from the given cube to generate a
new sub-cube. It represents the information from another point of view.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

Types of OLAP Systems


1. Multidimensional OLAP (MOLAP):
 Storage Model: MOLAP systems store data in a multidimensional cube format.
 Performance: MOLAP provides fast query performance, as it pre aggregates data
and stores it in a structured cube.
 Advantages: Quick query response times, efficient for complex calculations, and
well-suited for scenarios where data does not frequently change

2. Relational OLAP (ROLAP):


 Storage Model: ROLAP systems store data in relational databases, typically using
tables and joins.
 Performance: ROLAP offers flexibility but may have slower query response times
compared to MOLAP, as it calculates aggregations on the-fly.
 Advantages: Well-suited for large datasets and scenarios where data is subject to
frequent updates.

3. Hybrid OLAP (HOLAP):


 Combination: HOLAP systems combine elements of both MOLAP and ROLAP
approaches.
 Storage Model: HOLAP may store summarized data in a multidimensional cube
(MOLAP) for faster query performance, while detailed data is stored in relational
databases (ROLAP).
 Advantages: Seeks to balance the strengths of MOLAP and ROLAP, providing both
fast query performance and flexibility.

KLES JT BCA COLLEGE, GADAG 22 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data Cleaning (or Data Cleansing)


Data cleaning is a process that helps improve the quality of real-world data. Real-world data
is often messy, meaning it can be incomplete, noisy, and inconsistent.
Data cleaning helps make the data more accurate and useful for analysis.
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset.
There are numerous ways for data to be duplicated or incorrectly labeled when merging
multiple data sources
Data cleaning generally reduces errors and improves the quality of the data
Fixing data errors and eliminating false information may be a hard and timeconsuming
process, but it is necessary.
Many data mining techniques can be applied to the task of data purification

Handling Missing Values in Data Cleaning

When working with datasets like sales and customer data, missing values for certain attributes (e.g.,
customer income) are common. Here are several ways to handle missing values:

1. Ignore the Tuple: You can simply ignore the record (tuple) that has missing data, especially if
it's missing a class label (e.g., for classification tasks). However, this method isn't effective if
many values are missing or if the missing values vary across attributes.
2. Fill in the Missing Value Manually: Manually entering the missing value, which could be
time-consuming and impractical for large datasets with many missing entries.
3. Use a Global Constant to Fill in the Missing Value: You can replace missing values with a
constant (e.g., “Unknown” or −∞). This is easy but may mislead the mining algorithm, as it
might treat “Unknown” as a meaningful value.
4. Use the Attribute Mean to Fill in the Missing Value: Replace the missing value with the
average value of that attribute across the dataset. For example, if the average income is
$56,000, missing income values are replaced with $56,000.
5. Use the Attribute Mean for the Same Class: Instead of using the global mean, replace the
missing value with the mean for that attribute, but only for the same class. For instance, if
you are classifying customers by credit risk, use the average income of customers within the
same credit risk category to fill in the missing income value.
6. Use the Most Probable Value to Fill in the Missing Value: Predict the missing value based on
other attributes using techniques like regression, decision trees, or Bayesian inference. For
example, a decision tree could predict the missing income value based on other customer
characteristics. This method uses the most data and often provides more accurate results.

KLES JT BCA COLLEGE, GADAG 23 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data smoothing techniques using the some technical terms:

1. Binning: This strategy is fairly easy to comprehend. The values nearby are used to smooth
the sorted data. The information is subsequently split into several equal-sized parts. The
various techniques are then used to finish the assignment.
2. Regression: With the use of the regression function, the data is smoothed out. Regression
may be multivariate or linear. Multiple regressions have more independent variables than
linear regressions, which only have one.
3. Clustering: Groups similar data points into clusters. Values that fall outside these clusters
may be considered outliers. To detect outliers by identifying data points that don't belong to
any cluster.

Data cleaning process:


1. Discrepancy Detection:
 Finding errors or inconsistencies in data.
 Causes: Mistakes during entry, outdated info, inconsistent formats.
 How to find: Use knowledge about the data (metadata), check for outliers, missing
values, and unusual patterns.
 Tools: Data scrubbing and auditing tools help detect and fix common errors.
2. Data Transformation:
 Correcting errors by changing the data (e.g., renaming, reformatting).
 Tools: Data migration and ETL tools allow easy transformation; custom scripts can
be used for complex fixes.
 Challenges: Fixing one error may cause others, so it requires multiple rounds.
3. Iteration and Interactivity:
 Continuously refining and improving the data cleaning process.
 Tools: Tools like Potter’s Wheel allow step-by-step error fixing and immediate
feedback.
4. Updating Metadata:
 Keep track of what you learn about the data.
 Goal: Makes future data cleaning faster and easier.

KLES JT BCA COLLEGE, GADAG 24 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data Integration
 Data Integration is the process of combining data from different sources (e.g., databases,
spreadsheets, or files) into a single, unified dataset.
 This is often done to make the data easier to analyze, so businesses or organizations can get
a complete picture of their data in one place.
 Example: Imagine you have customer data in one database, product data in another, and
sales data in a third. Data integration helps to bring all these together so you can see the
relationship between customers, products, and sales.

 While performing data integration, you must work on data redundancy, inconsistency,
duplicity, etc

There are three issues to consider during data integration:

1. Schema Integration:
 Different data sources may use different names or structures for the same thing
 Example: One database might call a customer’s ID "customer_id," and another
database might call it "cust_number." Schema integration helps us figure out that
both refer to the same thing.
2. Redundancy Detection:
 Sometimes, we find duplicate data in different places. For example, one database
may store a customer’s address in two places. This can lead to confusion or errors.
 We need to find and remove these duplicates to avoid redundancy.
 Example: If two records in a database say the same thing about a customer (e.g.,
same name, same address), that's redundant data.
3. Data value conflicts:
 When we combine data from different sources, sometimes the values for the same
attribute don’t match. These are called value conflicts.
 Example: One source might list the price of a product in USD, and another might list
it in a different currency. We need to convert the prices into the same currency for
consistency.

KLES JT BCA COLLEGE, GADAG 25 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data Transformation
 Data transformation is the process of changing the format, structure, or values of data to
make it suitable for analysis or mining.
 It helps in making raw data more useful for extracting patterns and knowledge.

Types of Data Transformation

1. Data Smoothing:
 Smoothing means removing noise (unwanted or irrelevant data) from the dataset to
make it easier to analyze.
 Techniques like binning, regression, and clustering are used to clean and smooth
data by removing inconsistencies and errors.
 Example:
In daily sales data, some sales might be recorded incorrectly. Smoothing helps
eliminate these errors, making the data more reliable for analysis.

2. Aggregation
 Aggregation involves combining multiple data points into a summary form, such as
calculating totals or averages. This simplifies the data and helps in analyzing broader
trends.
 Example:
Daily sales data for the year could be aggregated to calculate monthly or annual
sales totals, which makes it easier to analyze sales trends over time.

3. Generalization
 It replaces specific, detailed data with broader categories or concepts to simplify the
dataset. This helps in recognizing patterns at a higher level.
 Example:
Street names can be generalized to a city or country level, and specific ages can be
generalized into age groups like youth, middle-aged, and senior.

4. Normalization
 Normalization is the process of adjusting data so that it fits within a specific range,
making it consistent across different datasets. This is often needed for machine
learning and analysis.
 Example:
Normalizing income data ranging from $12,000 to $98,000 so that it fits within a
scale of 0 to 1 helps ensure that other attributes, like age, don't dominate the
analysis because of their larger range.
 Types of Normalization:
 Min-max normalization
 Scales values to a new range, such as 0 to 1.
 Example: A value of $73,600 for income would be transformed to fit
within a range of 0.0 to 1.0.
 Z-score normalization
 Adjusts data based on the mean and standard deviation of the
dataset.

KLES JT BCA COLLEGE, GADAG 26 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Example: A value of $73,600 with a mean income of $54,000 and


standard deviation of $16,000 would be normalized to a z-score of
1.225.
 Decimal scaling
 Normalizes by moving the decimal point based on the maximum
absolute value.
 Example: Values ranging from -986 to 917 would be divided by 1,000
to normalize to the range of -0.986 to 0.917.

5. Attribute Construction (or Feature Construction)


 It creates new attributes from existing ones to improve the quality of analysis and
reveal hidden patterns.
 Example:
If you have height and width, a new attribute called "area" can be created by
multiplying the two. This new attribute can provide more meaningful insights in the
analysis.

In Summary:

 Smoothing helps clean the data by removing noise and inconsistencies.


 Aggregation simplifies data by summarizing it into totals or averages.
 Generalization simplifies detailed data by grouping it into broader categories.
 Normalization adjusts data to a common range, making it comparable across attributes.
 Attribute Construction creates new features from existing data to improve analysis.

Data Reduction
 When you work with large datasets, the data can become so big and complex that analyzing
it takes too long or even becomes impossible.
 Data reduction helps by making the dataset smaller but still keeps the important
information intact.

Here are the key strategies for data reduction:

1. Data Cube Aggregation


This involves grouping data into summary values by applying aggregation
operations (like sum, average, etc.).
Example:
If you have daily sales data for the entire year, you can aggregate it by month or by
quarter. Instead of looking at 365 days of data, you can work with only 12 months,
making the analysis simpler and faster.

2. Attribute Subset Selection


This technique helps you remove unnecessary or irrelevant information from the
data, focusing only on what really matters for the analysis.
Example:
If you are analyzing customer data to predict purchasing behavior, information like
favorite color or favorite TV show might be irrelevant. You can remove these
attributes to reduce the size of your data and improve efficiency.

KLES JT BCA COLLEGE, GADAG 27 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

3. Dimensionality Reduction
This involves reducing the number of features or attributes used to represent the
data without losing important information. Methods like Principal Component
Analysis (PCA) are often used here.
Example:
If you have 100 different attributes (e.g., customer age, income, spending habits),
dimensionality reduction might reduce them to 5 or 10 key components, helping
make the data easier to process.

4. Numerosity Reduction
This technique involves replacing the actual data with simpler models or
estimations that require less space to store.
Parametric Models: These represent the data using only the essential parameters
(instead of storing all raw data).
Clustering or Sampling: Using representative data points (instead of the entire
dataset) to analyze.
Histograms: Representing data as a summary of its frequency distribution.
Example:
Rather than storing every single sales transaction, you could represent sales data
using a parametric model that describes the overall sales pattern using just a few
numbers (like the average and variance).

5. Discretization and Concept Hierarchy Generation


This technique involves converting raw data into ranges or higher-level categories
to reduce complexity.
Discretization groups continuous data (like income) into ranges (e.g., low, medium,
high income).
Concept Hierarchy Generation creates levels of abstraction (e.g., instead of a street
address, you use the city or country).
Example:
If you're analyzing income, rather than looking at every exact value, you can group it
into ranges like under $30k, $30k-$60k, and over $60k, making the data easier to
work with.

Data reduction is about making the data smaller while keeping the important information. It helps
speed up analysis and improves efficiency, making it easier to work with large datasets.

KLES JT BCA COLLEGE, GADAG 28 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Data Discretization
Data discretization is the process of reducing a continuous attribute (like age, price, or
temperature) into intervals or ranges.
This simplifies the data by replacing many specific values with few general labels (e.g.,
converting age values into age groups like youth, middle-aged, and senior).
Example:
Instead of recording ages like 22, 33, 45, etc., you might group them into categories like 18-
25, 26-40, and 41-60.
The range of the attribute is divided into intervals, and then each value is replaced by its
interval label.
o Before discretization: Age = 33
o After discretization: Age = "26-40"

Top-Down vs. Bottom-Up Discretization:

1. Top-Down (Splitting):
o This method starts by splitting the entire range of data into some initial
intervals (or "cut points"), and then keeps dividing the intervals further.
o Example: Starting with age as 0-100 years, then splitting into smaller
intervals like 0-30, 31-60, 61-100.

2. Bottom-Up (Merging):
o This method starts with all individual values and then merges similar values
together into larger intervals.
o Example: If the data has many age values like 22, 23, 24, it may merge them
into a single group, like "20-25."

KLES JT BCA COLLEGE, GADAG 29 | P a g e


Fundamentals of Data Science UNIT 1 - DATA MINING AND DATA WAREHOUSE

Assignment Questions

2 marks

1. What is data mining?

2. Define KDD

3. Explain clustering

4. Explain Association rule mining

5. Define data warehouse

6. List out the Application of Data mining

7. List out OLAP operations

5 marks

1. Explain the process of knowledge Discovery in data mining

2. Differentiate between DBMS and Data mining

3. Differentiate between Data Warehouse and Data mining

4. Explain problems in data mining

5. Explain OLAP operations

6. Explain Data Transformation

10 marks

1. Explain in detail about Data mining techniques

2. Explain with a neat diagram 3 tier architecture of data warehouse

3. Explain Multidimensional Data model

4. Explain Data Reduction

KLES JT BCA COLLEGE, GADAG 30 | P a g e

You might also like