KEMBAR78
Report On Principles of Fragmentation in Computer Science | PDF | Data Warehouse | Data Mining
0% found this document useful (0 votes)
29 views26 pages

Report On Principles of Fragmentation in Computer Science

The document discusses various concepts related to data mining and data warehousing, including KDD, data mining techniques, web mining, text mining, and the architecture of data warehouses. It outlines the steps involved in designing a data warehouse, the Apriori algorithm for finding frequent itemsets, and various data mining functionalities such as classification, regression, clustering, and anomaly detection. The content emphasizes the importance of structured data management and analysis for decision-making processes.

Uploaded by

manishbej2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

Report On Principles of Fragmentation in Computer Science

The document discusses various concepts related to data mining and data warehousing, including KDD, data mining techniques, web mining, text mining, and the architecture of data warehouses. It outlines the steps involved in designing a data warehouse, the Apriori algorithm for finding frequent itemsets, and various data mining functionalities such as classification, regression, clustering, and anomaly detection. The content emphasizes the importance of structured data management and analysis for decision-making processes.

Uploaded by

manishbej2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

23/0

Manish
2060 IT 6 th
DWD PEC-
IT602B-N

Q i) KDD stands for Knowledge Discovery


in Databases. It is the overall process of
discovering useful knowledge from data.
KDD involves several steps, including
data selection, preprocessing,
transformation, data mining, and
interpretation/evaluation of the
discovered patterns.
ii) Data mining is a core step in the KDD
process. It is the application of specific
algorithms for extracting patterns from
data. Data mining techniques include
classification, clustering, association rule
mining, regression, and others. The goal
of data mining is to uncover hidden
patterns, relationships, and insights from
large datasets.
iii) Web mining is the application of data
mining techniques to discover patterns
from web data, including web content,
structure, and usage data. It involves
three main categories:
- Web content mining: Extracting useful
information from the content of web
pages, such as text, images, and
multimedia.
- Web structure mining: Analysing the
structure of web pages and the
hyperlinks between them to uncover
patterns and relationships.
- Web usage mining: Analysing web
server logs and user browsing behaviour
to understand user preferences and
navigation patterns.
iv) Text mining, also known as text data
mining or text analytics, refers to the
process of extracting meaningful
insights and patterns from unstructured
text data. It involves techniques such as
natural language processing,
information retrieval, and machine
learning to analyse and interpret large
collections of textual data, such as
documents, emails, social media posts,
and more.
v) A data warehouse is a centralized
repository designed to store and
manage large volumes of historical data
from various sources within an
organization. It integrates, consolidates,
and organizes data from different
operational systems, making it easier to
analyse and report on for decision-
making purposes. Data warehouses
typically follow a subject-oriented,
integrated, time-variant, and non-
volatile design.
vi) OLAP : stands for Online Analytical
Processing. It is a category of software
technologies and systems that enable
users to analyse multidimensional data
interactively from multiple perspectives.
OLAP provides mechanisms for complex
analytical queries, data summarization,
and visualization, supporting decision-
making processes.
OLTP : stands for Online Transaction
Processing. It refers to the systems and
technologies that facilitate and manage
transaction-oriented applications,
typically used for data entry and
retrieval operations in operational
systems, such as order processing,
inventory management, and financial
transactions. OLTP systems are designed
to handle a high volume of concurrent
transactions with high reliability,
availability, and performance.
Q2. Explain about the Three-tier data
warehouse architecture with a neat
Ans..) The Three-tier data warehouse
architecture is a widely used design
pattern for building scalable and efficient
data warehouses. It separates the
components of a data warehouse into
three distinct layers: the bottom tier (data
staging), the middle tier (the core data
warehouse), and the top tier (data marts).
This architecture allows for better
management, scalability, and
performance optimization. Here's a
detailed explanation along with a neat
diagram:
Data Marts
(Subject-Oriented,
Summarized Data)

Core Data
Warehouse
(Integrated, Subject
Independent Data)
Data Staging Area
(Cleansing, Transformation,
Temporary Storage)

Operational Data
Sources
(Databases, Files,
External Data)

Bottom Tier - Data Staging Area:


1. Data Extraction: This layer is
responsible for extracting data from
various operational data sources, such
as transactional databases, legacy
systems, flat files, and external data
providers. Extraction processes are
typically scheduled to run at regular
intervals (e.g., nightly, weekly) or
triggered by events (e.g., new data
arrival).
2. Data Cleansing and Transformation:
The staging area performs essential
data quality tasks, including data
cleansing, deduplication, error
handling, and data transformation.
Common transformations include:
 Removing or fixing invalid or

missing data
 Standardizing data formats (e.g.,

date, currency, addresses)


 Handling slowly changing

dimensions
 Applying business rules and data

validation checks
3. Temporary Storage: The staging
area acts as a temporary storage area
for the extracted and transformed data
before loading it into the core data
warehouse. This allows for better
management of data flows and enables
parallel processing of different data
streams.
4. Metadata Management: Metadata,
which is data about data, is captured
and managed in the staging area. This
includes information about data
sources, transformations, data quality
rules, and other metadata that
supports data lineage and auditing.
Middle Tier - Core Data Warehouse:
1. Data Integration: The core data
warehouse integrates data from
multiple sources, resolving any data
inconsistencies, redundancies, or
conflicts. This process ensures a
consistent and unified view of data
across the organization.
2. Subject-Independent Data
Structure: The core data warehouse
typically employs a subject-
independent data model, such as a
normalized or denormalized schema, to
store atomic-level data. This structure
allows for maximum flexibility in data
analysis and reporting.
3. History and Audit Tracking: The core
data warehouse maintains a historical
record of data changes over time,
enabling time-based analysis and data
auditing capabilities.
4. Data Partitioning and Indexing: To
optimize query performance, the core
data warehouse employs partitioning
and indexing strategies based on
common access patterns and workload
characteristics.
5. Backup and Recovery: Robust
backup and recovery mechanisms are
implemented to ensure data integrity
and business continuity in case of
system failures or disasters.
Top Tier - Data Marts:
1. Subject-Oriented Data Structure:
Data marts are organized around
specific subjects or business areas,
such as sales, finance, or marketing.
The data structure is optimized for the
specific analytical requirements of each
subject area, often using a dimensional
or denormalized schema.
2. Data Summarization and
Aggregation: Data marts typically
contain summarized and aggregated
data derived from the core data
warehouse, tailored for specific
reporting and analysis needs. This
helps improve query performance for
common analytical workloads.
3. Query Optimization: Data marts are
designed and optimized for specific
query patterns and workloads,
employing techniques such as
materialized views, indexing, and
caching to enhance query
performance.
4. User Access Controls: Data marts
often have more granular user access
controls and security measures in place
to ensure data privacy and compliance
with relevant regulations and policies.
5. Departmental or Functional Focus:
Data marts are typically focused on
serving the analytical needs of specific
departments or functional areas within
an organization, such as marketing,
finance, or sales.
It's important to note that the three-tier
architecture is a logical separation of
concerns, and in practice, the physical
implementation may vary based on
organizational needs, data volumes, and
performance requirements. Some
organizations may combine the staging
area and core data warehouse into a
single physical layer, while others may
have multiple core data warehouses or
data marts for different business units or
use cases.
The three-tier data warehouse
architecture promotes data quality,
scalability, and performance optimization
while providing a structured approach to
managing and analyzing large volumes of
data from diverse sources.
Q3. What are steps in designing the data
warehouse ?
Ans..) Designing a data warehouse
involves several key steps to ensure it
meets the organization's analytical and
reporting requirements. Here are the
typical steps involved in designing a
data warehouse:
1. Define Business Requirements and
Goals : Understand the organization's
business objectives, key performance
indicators (KPIs), and the types of
analyses and reports required. This step
helps determine the scope and
requirements of the data warehouse.
2. Identify and Analyse Data Sources :
Identify the various operational data
sources (e.g., transactional systems,
databases, flat files) that will feed data
into the data warehouse. Analyse the
data structures, data quality, and
consistency of these sources.

3. Design the Dimensional Model :


Choose the appropriate dimensional
modelling technique (e.g., star schema,
snowflake schema) to organize the data
in a way that supports efficient querying
and analysis. This involves identifying
the facts, dimensions, and hierarchies
based on the business requirements.
4. Design the Extract, Transform, and
Load (ETL) Process : Plan the ETL
processes that will extract data from
source systems, transform and clean the
data as per the dimensional model, and
load it into the data warehouse. This
includes designing data mappings,
defining transformation rules, and
scheduling ETL jobs.
5. Design the Data Staging Area :
Determine the requirements for the data
staging area, where data will be
temporarily stored, cleansed, and
transformed before loading into the data
warehouse.
6. Design the Data Warehouse
Architecture : Decide on the appropriate
architecture (e.g., three-tier, hub-and-
spoke) based on the organization's
scalability, performance, and
maintenance needs. This includes
determining the hardware infrastructure,
database management system, and
storage requirements.
7. Design the Data Marts : Identify the
specific subject areas or departmental
requirements and design the
corresponding data marts. Data marts
are subsets of the data warehouse
optimized for specific analytical
workloads.
8. Design the Metadata Repository :
Plan for a metadata repository to store
and manage metadata about the data
warehouse objects, data sources,
transformations, and business rules.
Metadata is essential for data lineage,
documentation, and impact analysis.
9. Design the Security and Access
Controls : Define the security and access
control measures to ensure data privacy,
confidentiality, and compliance with
relevant regulations and policies.
10. Design the Backup and Recovery
Strategies : Develop strategies for
backup, recovery, and disaster recovery
to maintain data integrity and business
continuity.
11. Design the Monitoring and
Management Processes : Establish
processes for monitoring the data
warehouse performance, usage, and
data quality, as well as processes for
managing changes and enhancements.
12. Develop and Test the Data
Warehouse : Implement the designed
components, including the ETL
processes, dimensional models, and
data marts. Conduct thorough testing to
validate the data quality, performance,
and accuracy of the data warehouse.
13. Deploy and Maintain the Data
Warehouse : Roll out the data
warehouse into production, provide user
training, and establish ongoing
maintenance and enhancement
processes based on evolving business
needs.
Designing a data warehouse is an
iterative process that may involve
revisiting and refining the design
decisions based on feedback,
performance evaluations, and changing
business requirements.

Q4. Explain about the Apriori algorithm


for finding frequent item sets with an
example.
Ans..) The Apriori algorithm is a popular
algorithm used in data mining for finding
frequent itemsets from a given dataset.
A frequent itemset is a set of items that
frequently appear together in the
transactions of a dataset. The Apriori
algorithm is designed to operate on
databases containing transactions,
where each transaction is a set of items.
The algorithm works in two steps:
1. Generate frequent itemsets : In this
step, the algorithm generates all
possible itemsets and calculates their
support (i.e., the fraction of transactions
in which the itemset appears). Itemsets
with support above a user-specified
minimum support threshold are
considered frequent.
2. Generate association rules : From the
frequent itemsets, the algorithm
generates association rules that satisfy a
minimum confidence threshold. An
association rule is an expression of the
form X => Y, where X and Y are disjoint
itemsets. The confidence of the rule is
the conditional probability of Y occurring
given that X has occurred.
Here's an example to illustrate the
Apriori algorithm:
Consider the following dataset of
transactions, where each transaction is
represented by a set of items (A, B, C, D,
E):
T1: {A, B, C, D}
T2: {B, C, E}
T3: {A, B, C, E}
T4: {A, B, D}
T5: {A, C}
T6: {B, C, D}
T7: {A, C, D}
T8: {A, B, C}
T9: {A, B, D}
Let's assume a minimum support
threshold of 3 (i.e., an itemset must
appear in at least 3 transactions to be
considered frequent).
Step 1: Generate frequent itemsets.
- The algorithm starts by counting the
occurrences of each individual item (1-
itemsets) in the dataset:
- A: 6 occurrences
- B: 6 occurrences
- C: 6 occurrences
- D: 5 occurrences
- E: 2 occurrences
- Since the minimum support threshold
is 3, the frequent 1-itemsets are {A},
{B}, {C}, {D}.
- Next, the algorithm generates
candidate 2-itemsets by combining the
frequent 1-itemsets and counts their
occurrences:
- {A, B}: 4 occurrences
- {A, C}: 4 occurrences
- {A, D}: 3 occurrences
- {B, C}: 4 occurrences
- {B, D}: 3 occurrences
- {C, D}: 2 occurrences
- The frequent 2-itemsets are {A, B}, {A,
C}, {A, D}, {B, C}, {B, D}.
- The algorithm continues generating
candidate 3-itemsets, counting their
occurrences, and finding the frequent 3-
itemsets, which are {A, B, C}: 3
occurrences.
Step 2: Generate association rules.
- From the frequent itemsets, the
algorithm generates association rules
that satisfy a minimum confidence
threshold (e.g., 60% confidence).
- For example, from the frequent itemset
{A, B, C}, the algorithm can generate
rules like: - A, B => C (confidence =
3/4 = 75%)

- A, C => B (confidence = 3/4 =


75%)

- B, C => A (confidence = 3/4 =


75%)
The Apriori algorithm is an iterative
process that generates candidate
itemsets of increasing length and prunes
them based on the support threshold. It
is computationally expensive for large
datasets, but various optimizations and
alternatives have been proposed to
improve its efficiency.
Q5. Explain in detail about Data mining
functionalities?
Ans..) Data mining involves several
functionalities or tasks that aim to
uncover patterns, relationships, and
insights from large datasets. The
primary data mining functionalities are:
1. Classification : This involves
assigning items in a dataset to
predefined classes or categories.
Classification algorithms learn from a
training dataset containing instances
with known class labels and build a
model that can classify new, unseen
instances into the appropriate classes.
Common classification techniques
include decision trees, logistic
regression, naive Bayes, support vector
machines, and neural networks.
2. Regression : Regression is used to
predict or estimate a continuous
numerical value based on one or more
input variables. It finds the relationship
between the dependent variable (the
value to be predicted) and the
independent variables (the predictors).
Linear regression, polynomial
regression, and support vector
regression are examples of regression
techniques.
3. Clustering : Clustering is an
unsupervised learning technique that
groups similar instances or data points
together based on their characteristics
or features. The goal is to identify
natural clusters or groups within the
data without any prior knowledge of
their membership. K-means, hierarchical
clustering, and density-based clustering
(e.g., DBSCAN) are popular clustering
algorithms.
4. Association Rule Mining : This
functionality aims to discover interesting
relationships or associations between
items in a dataset. Association rule
mining is commonly used in market
basket analysis to understand customer
purchasing patterns and identify
frequently co-occurring items. The
Apriori algorithm and FP-growth are
widely used for association rule mining.

5. Anomaly Detection : Anomaly


detection, also known as outlier
detection, involves identifying rare or
unusual instances that deviate
significantly from the normal pattern or
behavior in a dataset. These anomalies
can be indicative of potential issues,
fraud, or new opportunities. Techniques
like statistical methods, distance-based
methods, and density-based methods
are used for anomaly detection.

6. Sequence Analysis : Sequence


analysis involves identifying patterns or
trends in sequential data, such as time-
series data, biological sequences, or
customer journeys. It can be used for
tasks like identifying recurring patterns,
predicting future events, or
understanding customer behavior over
time. Techniques like hidden Markov
models, frequent pattern mining, and
time-series analysis are employed for
sequence analysis.

7. Text Mining : Text mining focuses on


extracting meaningful insights and
knowledge from unstructured text data,
such as documents, emails, social media
posts, and web pages. It involves tasks
like text categorization, sentiment
analysis, topic modeling, and named
entity recognition. Natural language
processing (NLP) techniques and
machine learning algorithms are
commonly used for text mining.

8. Web Mining : Web mining involves


analyzing and extracting valuable
information from web data, including
web content, web structure (hyperlinks),
and web usage data (user browsing
patterns). It can be used for tasks like
information retrieval, website
optimization, personalization, and user
behavior analysis.
9. Visualization : Data visualization is an
essential functionality in data mining
that helps in representing and
communicating the patterns,
relationships, and insights discovered
from the data in a visual and
interpretable manner. Techniques like
scatter plots, heatmaps, network
diagrams, and interactive dashboards
are commonly used for data
visualization.
These data mining functionalities are
often combined and used together to
address various business problems, such
as customer segmentation, fraud
detection, recommender systems, risk
analysis, and predictive maintenance,
among others.

You might also like