Data Warehousing and Data Mining Lab
Data Warehousing and Data Mining Lab
VISION
“To be centre of excellence in education, research and technology transfer in the field
of computer engineering and promote entrepreneurship and ethical values.”
MISSION
“To foster an open, multidisciplinary and highly collaborative research environment to
produce world-class engineers capable of providing innovative solutions to real life
problems and fulfill societal needs.”
Department of Computer Science and Engineering
Rubrics for Lab Assessment
Rubrics 0 1 2 3
Missing Inadequate Needs Improvement Adequate
R1 Is able to No mention is An attempt is The problem to be The problem to be
identify the made of the made to identify solved is described solved is clearly stated.
problem to problem to be the problem to be but there are minor Objectives are
be solved and solved. solved but it is omissions or vague complete, specific,
define the described in a details. Objectives are concise, and
objectives of confusing manner, conceptually correct measurable. They are
the objectives are not and measurable but written using correct
experiment. relevant, may be incomplete in technical terminology
objectives contain scope or have and are free from
technical/ linguistic errors. linguistic errors.
conceptual errors
or objectives are
not measurable.
R2 Is able to The experiment The experiment The experiment The experiment solves
design a does not solve the attempts to solve attempts to solve the the problem and has a
reliable problem. the problem but problem but due to high likelihood of
experiment due to the nature the nature of the producing data that will
that solves of the design the design there is a lead to a reliable
the problem. data will not lead moderate chance the solution.
to a reliable data will not lead to a
solution. reliable solution
R3 Is able to Diagrams are Diagrams are Diagrams and/or Diagrams and/or
communicate missing and/or present but experimental experimental procedure
the details of experimental unclear and/or procedure are present are clear and complete.
an procedure is experimental but with minor
experimental missing or procedure is omissions or vague
procedure extremely vague. present but details.
clearly and important details
completely. are missing.
R4 Is able to Data are either Some important All important data are All important data are
record and absent or data are absent or present, but recorded present, organized and
represent incomprehensible. incomprehensible. in a way that requires recorded clearly.
data in a some effort to
comprehend.
meaningful
way.
R5 Is able to No discussion is A judgment is An acceptable An acceptable judgment
make a presented about made about the judgment is made is made about the
judgment the results of the results, but it is about the result, but result, with clear
about the experiment. not reasonable or the reasoning is reasoning. The effects
results of the coherent. flawed or incomplete. of assumptions and
experiment. experimental
uncertainties are
considered.
PRACTICAL RECORD
Branch : CSE-2
5. Implementation of Classification
technique on ARFF files using
WEKA.
6. Implementation of Clustering
technique on ARFF files using
WEKA.
Theory:
The ETL (Extract, Transform, Load) process is vital for data warehousing and integration, facilitating the
collection of data from multiple sources, transforming it into a structured format, and loading it into a central
repository for analysis, reporting, and decision-making. Here's a breakdown of each stage in the ETL process:
1. Extract
o Purpose: This step involves extracting data from various sources such as transactional
databases, spreadsheets, legacy systems, or cloud services.
o Process: Extraction methods vary depending on the data source. For databases, it might involve
SQL queries; for web sources, APIs are used; and for unstructured data, techniques like web
scraping are employed.
o Challenges: Dealing with multiple data sources often involves different formats and structures,
requiring careful handling to maintain accuracy and completeness.
2. Transform
o Purpose: Here, the extracted data is processed and converted into a usable format that meets
the requirements of the target data warehouse.
o Process: This involves data cleaning (removing duplicates, handling missing values), data
mapping, aggregation, and standardizing formats (like dates or currency).
o Techniques: Common transformations include filtering data, merging datasets, calculating new
metrics, applying business rules, and reformatting data types.
o Challenges: Ensuring data integrity and consistency requires a clear understanding of business
rules and data relationships.
3. Load
o Purpose: The final step is to load the transformed data into a target system, typically a data
warehouse or data mart, where it can be accessed for analysis and reporting.
o Process: Data loading can be done in two ways:
Full Load: Loading all data at once, typically during initial setup or major data refreshes.
Incremental Load: Loading only new or updated data to keep the dataset current
without redundancy.
o Challenges: Ensuring efficient, reliable data loads that meet performance requirements,
particularly in large-scale environments with high data volumes.
Common ETL Tools Several ETL tools are widely used in the industry, each offering unique features and
capabilities:
1. Informatica PowerCenter: Known for its robust data integration capabilities, data quality assurance,
and support for various data formats.
2. Apache NiFi: An open-source tool focused on automating data flows, with an emphasis on security and
scalability.
3. Microsoft SQL Server Integration Services (SSIS): A popular ETL tool provided by Microsoft, designed
for data migration, integration, and transformation tasks within the SQL Server environment.
4. Talend: An open-source ETL tool offering extensive data integration features suitable for data cleansing,
transformation, and loading in both cloud and on-premises environments.
5. Pentaho Data Integration (PDI): Part of the Pentaho suite, offering user-friendly data integration and
transformation features, supporting big data and analytics.
Importance of ETL in Data Warehousing The ETL process is crucial for building reliable and efficient
data warehouses. It ensures data from diverse sources is unified, cleaned, and standardized, providing a
single source of truth for analytics and reporting. This consistency is vital for businesses to make
accurate, data-driven decisions, ensuring all departments rely on the same high-quality data.
Additionally, the ETL process supports data governance and compliance, as transformations can enforce
data quality standards and regulatory requirements, which is essential for industries with strict
compliance regulations.
Additionally, the ETL process supports data governance and compliance, as transformations can enforce
data quality standards and regulatory requirements, which is essential for industries with strict
compliance regulations.
Modern ETL tools often support real-time data integration, enabling organizations to access and analyze
up-to-date information. This capability is particularly valuable for businesses that rely on timely data for
critical operations, such as financial services, healthcare, and e-commerce.
ETL processes play a crucial role in improving data quality by incorporating data cleansing and validation
steps. These processes identify and rectify inconsistencies, inaccuracies, and duplicates, ensuring that
the data used for analysis and reporting is accurate and reliable. This leads to more credible insights
and better-informed decision-making.
ETL processes also facilitate the management of historical data by archiving and maintaining data from
various time periods. This historical data can be invaluable for trend analysis, forecasting, and long-term
strategic planning, providing a comprehensive view of the organization's performance over time.
In summary, the ETL process is not just a technical necessity but a strategic asset for organizations
looking to harness the full potential of their data. It ensures that data is accessible, accurate, and
actionable, forming the backbone of effective data-driven decision-making. With the right ETL tools and
processes in place, businesses can achieve higher efficiency, improved data quality, and greater
flexibility, positioning themselves for success in an increasingly data-centric world.
Experiment – 2
Aim: Program of Data warehouse cleansing to input names from users (inconsistent) and format them.
Theory:
Data cleansing, also known as data cleaning or data scrubbing, is a crucial step in preparing data for analysis
and storage in a data warehouse. It involves identifying, correcting, or removing inaccuracies, inconsistencies,
and errors from datasets to ensure high data quality. Clean data is essential in a data warehouse because it
directly impacts the accuracy and reliability of insights derived from the data. Errors in data can lead to
misleading conclusions, faulty analysis, and suboptimal decision-making.
Importance of Data Cleansing
1. Enhances Data Quality: Clean data is accurate, consistent, and complete. Quality data provides reliable
insights and reduces the likelihood of errors in analytical outcomes.
2. Improves Decision Making: Accurate and consistent data lead to more effective decisions, driving
better business outcomes.
3. Boosts Efficiency: Data cleansing streamlines data processing by reducing redundancy and
standardizing data formats, making the data easier to analyze and interpret.
4. Maintains Consistency: Standardizing data formats across sources ensures uniformity, enabling
seamless integration in a data warehouse.
5. Ensures Compliance: Many industries require data accuracy for regulatory compliance. Data cleansing
ensures that data meets industry standards.
Common Issues in Data Cleansing
Inconsistent data can arise from various sources, including human error, different data formats, or legacy
systems. Some typical issues include:
1. Inconsistent Formatting: Variations in capitalization, spacing, or punctuation (e.g., "john doe" vs. "John
Doe").
2. Duplicate Entries: Repeated records for the same entity, leading to skewed analysis.
3. Incomplete Data: Missing information in one or more fields.
4. Incorrect Data: Values that do not match expected patterns or contain obvious errors (e.g., incorrect
phone numbers or email formats).
Data Cleansing Process
The data cleansing process typically includes these steps:
1. Data Profiling: Analyzing the data to understand its structure, content, and patterns. This helps identify
specific areas that require cleansing.
2. Data Standardization: Applying uniform formats to data, such as consistent capitalization, removing
special characters, or using standardized date formats.
3. Data Validation: Checking data against predefined rules or patterns to identify outliers or inaccuracies.
4. Data Enrichment: Filling missing information or correcting data using external reference data.
5. Data Deduplication: Identifying and removing duplicate records to avoid redundancy.
Tools for Data Cleansing
Many data integration and ETL tools offer data cleansing functionalities. Here are some widely used tools:
Informatica Data Quality: Provides data profiling, cleansing, and standardization features, popular for
enterprise-level data cleansing.
Trifacta: Known for its user-friendly interface, offering data profiling, transformation, and visualization
for cleansing workflows.
OpenRefine: An open-source tool that allows users to clean and transform data in bulk, with features
for clustering similar values and removing duplicates.
Python: Libraries like pandas offer robust data manipulation functions, enabling custom data cleansing
scripts tailored to specific needs.
Code:
def cleanse_name(name):
# Remove leading and trailing spaces
cleaned_name = name.strip()
# Replace multiple spaces with a single
space cleaned_name = "
".join(cleaned_name.split()) # Capitalize
each word in the name cleaned_name =
cleaned_name.title()
return cleaned_name
Redundancy in data warehousing occurs when duplicate records or entries are present, leading to inaccurate
analysis, increased storage costs, and performance inefficiencies. Redundant data can emerge from
integrating data from multiple sources, manual data entry errors, or other inconsistencies. Removing
redundancy is essential for maintaining data quality, improving efficiency in data processing, and ensuring
accurate analysis.
Common Techniques for Removing Redundancy
1. Deduplication: Identifying and removing duplicate records based on specific columns or combinations
of columns.
2. Primary Key Constraints: Ensuring unique identifiers (such as IDs) in a database to prevent duplicate
entries.
3. Data Merging and Consolidation: Aggregating or merging data from multiple sources and applying
deduplication rules.
4. Standardization: Normalizing data fields (such as name or address) to consistent formats, which helps
in identifying duplicates.
Benefits of Removing Redundancy
Improved Data Accuracy: By eliminating duplicate entries, the data becomes more accurate, leading to
more reliable insights and decisions.
Cost Efficiency: Reducing redundant data decreases storage requirements and associated costs.
Enhanced Performance: Streamlined data sets improve database performance and reduce the time
needed for data processing tasks.
Better Data Quality: Ensuring that only unique and accurate data is stored enhances the overall quality
of the data, making it more useful for analysis.
Challenges in Removing Redundancy
Complexity in Identification: Identifying duplicate records, especially in large datasets with complex
structures, can be challenging.
Maintaining Data Integrity: Ensuring that deduplication processes do not inadvertently remove valid
records or introduce errors.
Consistency Across Systems: Achieving consistency in data formats and standards across different
sources and systems requires careful planning and execution.
Tools for Removing Redundancy
Several tools are widely used for data deduplication and cleansing, including:
Informatica Data Quality: Offers comprehensive deduplication and data cleansing features, suitable for
large-scale enterprise environments.
Trifacta: Provides user-friendly interfaces for data profiling and transformation, aiding in the
identification and removal of duplicates.
OpenRefine: An open-source tool that supports bulk data cleaning, deduplication, and transformation
with powerful clustering algorithms.
Code:
import pandas as pd
# Sample dataset with duplicate
entries data = {
'CustomerID': [101, 102, 103, 104, 101, 102],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob'],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com',
'david@example.com', 'alice@example.com', 'bob@example.com'],
'PurchaseAmount': [250, 150, 300, 200, 250, 150]
}
# Load data into a
DataFrame df =
pd.DataFrame(data)
Classification is a supervised machine learning technique used to categorize data points into predefined
classes or labels. By training on a labelled dataset, a classification algorithm learns patterns in the data to
predict the class labels of new, unseen instances. This technique is crucial in applications such as spam
detection, medical diagnosis, sentiment analysis, and image recognition.
1. Supervised Learning
o In supervised learning, models are trained on a dataset with known labels. Each instance in the
training set includes features (input variables) and a target class label (output variable), allowing
the model to learn from examples.
2. Types of Classification
o Binary Classification: Involves two classes, such as "spam" vs. "not spam."
o Multiclass Classification: Involves more than two classes, like classifying images as "cat," "dog,"
or "bird."
o Decision Trees: Utilize a tree-like model to make decisions based on attribute values, with nodes
representing features and branches indicating decision outcomes. Examples include CART and
C4.5 algorithms.
o Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature independence
within each class. It’s fast and effective for text classification tasks.
o k-Nearest Neighbors (k-NN): A non-parametric method that classifies a point based on the
majority label of its closest k neighbors in the feature space.
o Support Vector Machines (SVM): Finds a hyperplane that maximally separates classes in a high-
dimensional space, often effective for complex and high-dimensional data.
o Neural Networks: Complex models inspired by the human brain, effective for large datasets and
capable of learning intricate patterns through multiple hidden layers.
o Training Set: The portion of data used to train the classification model.
o Testing Set: A separate portion of data used to evaluate model performance. Typically, data is
split into 70-80% for training and 20-30% for testing.
5. Evaluation Metrics
o Precision: The fraction of true positive predictions out of all positive predictions made (useful
when false positives are costly).
o Recall (Sensitivity): The fraction of true positive predictions out of all actual positives (useful
when false negatives are costly).
o F1 Score: The harmonic mean of precision and recall, balancing both metrics.
o Confusion Matrix: A matrix summarizing correct and incorrect predictions for each class,
providing insights into specific errors.
3. Algorithm Selection: Choose a classification algorithm based on your data and analysis needs.
5. Model Testing: Test the model using the testing set and evaluate its performance with metrics like
accuracy, precision, recall, and F1 score.
4. Click on “Classify” and choose the "IBK" algorithm from the "lazy" section of the classifiers.
5. Congifure the model by clicking on the “IBK” classifier.
Clustering is an unsupervised machine learning technique used to group similar data points into clusters.
The goal is to organize a dataset into distinct groups where data points in the same group (or cluster) are
more similar to each other than to those in other clusters. Unlike classification, clustering does not rely on
labelled data; instead, it explores inherent structures within the data itself.
1. Unsupervised Learning
o k-Means Clustering:
Partitions data into k clusters by minimizing the variance within each cluster.
Starts by selecting k initial centroids (center points), assigns each data point to the
nearest centroid, and iteratively updates centroids based on cluster members.
o Hierarchical Clustering:
Agglomerative (bottom-up): Each data point starts as its own cluster, which
gradually merges into larger clusters.
Divisive (top-down): Starts with all data points in a single cluster and splits
them into smaller clusters.
Often visualized through dendrograms, making it easy to see how clusters split or
merge at each level.
Groups points that are close to each other (dense regions) while marking points in
low-density regions as outliers.
Effective for identifying arbitrarily shaped clusters and handling noise, unlike k-
means, which assumes spherical clusters.
Flexible and allows for overlapping clusters, making it useful for complex
distributions.
3. Distance Measures
Cosine Similarity: For text or sparse data, measuring the angle between vectors
rather than direct distance.
3. Algorithm Selection: Choose a clustering algorithm based on your data and analysis needs.
5. Model Evaluation: Evaluate the model's performance using metrics like cluster purity, silhouette
score, or other relevant measures.
6. Visualization: Use WEKA’s visualization tools to explore and interpret clustering results.
Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
4. Click on “Cluster” and choose the "SimpleKMeans" algorithm from the "clusterers".
5. Congifure the model by clicking on “SimpleKMeans”.
6. Click on “Start” and WEKA will provide K-Means summary.
7. We can visualize the linear regression model in the visualization tab.
Experiment – 7
Aim: Implementation of Association Rule technique on ARFF files using WEKA.
Theory:
Association rule mining is a data mining technique used to discover interesting relationships, patterns, or
associations among items in large datasets. It is widely used in fields like market basket
analysis, where the goal is to understand how items co-occur in transactions. For instance, an association
rule might identify that "Customers who buy bread also tend to buy butter." Such insights can support
product placements, recommendations, and targeted marketing.
Key Concepts in Association Rule Mining
1. Association Rules:
o An association rule is typically in the form of X→Y, which means "If itemset X is present in
a transaction, then itemset Y is likely to be present as well."
o Each association rule is evaluated based on three main metrics:
Support: The frequency with which an itemset appears in the dataset. Support
for X→Y is calculated as:
Higher support indicates that the rule is relevant for a larger portion of the dataset.
Confidence: The likelihood of seeing Y in a transaction that already contains
X. Confidence for X→Y is calculated as:
Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
2. Click on the "Open file" button to load your dataset.
Data visualization is the graphical representation of information and data. By employing visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to comprehend and
interpret trends, outliers, and patterns within data. This technique is crucial in data analysis, as it
transforms complex datasets into a more understandable and actionable format. Effective data
visualization not only enhances comprehension but also facilitates communication, exploration, and
decision-making processes.
1. Enhances Understanding
o Visualization simplifies complex data by converting it into a visual format that is easier to
interpret. This is particularly important for large datasets with numerous variables. By
visualizing data, users can quickly grasp the underlying patterns and relationships that might
be difficult to detect in raw data form.
2. Reveals Insights
o Visualizations can uncover hidden insights, correlations, and trends that may not be
apparent in raw data. For instance, scatter plots can reveal relationships between two
variables, while heatmaps can display patterns across multiple dimensions. This ability to
surface insights makes data visualization a powerful tool in data exploration and analysis.
3. Facilitates Communication
4. Supports Exploration
o Interactive visualizations allow users to explore data from different angles, facilitating
discovery and hypothesis generation. Users can drill down into specific data segments, adjust
parameters to observe changes, and interact with the visualization to uncover new insights.
This exploratory capability is invaluable for in-depth data analysis.
1. Bar Charts
o Bar charts display categorical data with rectangular bars representing the values. They are
effective for comparing different categories or showing changes over time. Bar charts can
be oriented vertically or horizontally, depending on the nature of the data and the
intended message.
2. Line Graphs
o Line graphs are used to display continuous data points over a period. They are ideal for
showing trends and fluctuations in data, such as stock prices or temperature changes over
time. Line graphs can help identify patterns, trends, and cycles in time-series data.
3. Histograms
o Histograms are similar to bar charts but are used for continuous data. They show the
frequency distribution of numerical data by dividing the range into intervals (bins) and
counting the number of observations in each bin. Histograms are useful for
understanding the distribution and variability of data.
4. Scatter Plots
o Scatter plots display the relationship between two continuous variables, with points
plotted on an x and y axis. They are useful for identifying correlations, trends, and
outliers. Scatter plots can highlight clusters, gaps, and potential anomalies in the data.
5. Box Plots
o Box plots summarize the distribution of a dataset by showing its median, quartiles, and
potential outliers. They are effective for comparing distributions across categories. Box
plots provide a clear summary of data dispersion and central tendency.
6. Heatmaps
o Heatmaps display data in matrix form, where individual values are represented by colors.
They are useful for visualizing correlations between variables or representing data
density in geographical maps. Heatmaps can quickly highlight areas of high and low
intensity within the data.
7. Pie Charts
o Pie charts represent data as slices of a circle, showing the proportion of each category
relative to the whole. They are best for displaying percentage shares but can be less
effective for comparing multiple values. Pie charts are useful for illustrating part-to-
whole relationships in a dataset.
8. Area Charts
o Area charts display quantitative data visually over time, similar to line graphs, but fill the
area beneath the line. They are useful for emphasizing the magnitude of values over
time. Area charts can show cumulative data trends and highlight the contribution of
individual segments to the whole.
1. Loading Data
o Import ARFF files into WEKA, which can be done through the Explorer interface. This step
involves selecting the dataset and loading it into the WEKA environment.
2. Data Pre-processing
o Clean and prepare the data using WEKA’s pre-processing tools. This may involve handling
missing values, normalizing data, and selecting relevant attributes for visualization.
o Select the appropriate visualization techniques based on the data and analysis goals.
WEKA provides various visualization options, including histograms, scatter plots, and bar
charts.
4. Generating Visualizations
o Use WEKA’s visualization tools to generate the chosen visualizations. This involves
configuring the visualization parameters, such as selecting the attributes to be visualized
and setting axis scales.
5. Interpreting Visualizations
o Analyze the visualizations to identify patterns, trends, and insights. Interpret the results
to support decision-making and communicate findings effectively.
6. Exporting Visualizations
o Export the generated visualizations for use in reports, presentations, or further analysis.
WEKA allows users to save visualizations in various formats, making it easy to share and
integrate them into other documents.
Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
Data similarity measures quantify how alike two data points or vectors are. These measures are fundamental
in various fields, including machine learning, data mining, and pattern recognition. Among the most
commonly used similarity measures are Euclidean distance and Manhattan distance, which are often used to
assess the closeness of points in a multidimensional space.
1. Euclidean Distance
Definition: Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated using the Pythagorean theorem and is one of the most commonly used distance measures in
mathematics and machine learning.
Formula: For two points P and Q in an n-dimensional space, where P=(p1,p2,…,pn) and Q=(q1,q2,…,qn), the
Euclidean distance d is given by:
Properties:
Non-negativity: d(P,Q)≥0
Identity: d(P,Q)=0 if and only if P=Q
Symmetry: d(P,Q)=d(Q,P)
Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R) for any points P,Q,R
Applications:
Used in clustering algorithms like K-means to determine the distance between points and centroids.
Commonly applied in image recognition, pattern matching, and other areas requiring
spatial analysis.
2. Manhattan Distance
Definition: Manhattan distance, also known as the "city block" or "taxicab" distance, measures the distance
between two points by summing the absolute differences of their coordinates. It reflects the total grid
distance one would need to travel in a grid-like path.
Formula: For two points P and Q in an n-dimensional space, the Manhattan distance d is calculated as:
Properties:
Non-negativity: d(P,Q)≥0
Identity: d(P,Q)=0 if and only if P=Q
Symmetry: d(P,Q)=d(Q,P)
Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R)
Applications:
Useful in scenarios where movement is restricted to a grid, such as in geographical data analysis.
Often employed in clustering algorithms and machine learning models where linear
relationships are more meaningful than straight-line distances.
Comparison of Euclidean and Manhattan Distance
1. Geometric Interpretation:
o Euclidean distance measures the shortest path between two points, while
Manhattan distance measures the total path required to travel along axes.
2. Sensitivity to Dimensions:
o Euclidean distance can be sensitive to the scale of data and the number of dimensions, as
it tends to emphasize larger values. In contrast, Manhattan distance treats all dimensions
equally, summing absolute differences.
3. Use Cases:
o Euclidean distance is preferred in applications involving continuous data and
geometric spaces, whereas Manhattan distance is favored in discrete settings, such as
grid-based environments.
Code:
!pip install liac-arff pandas scipy
import arff
from google.colab import
files import pandas as pd
from scipy.spatial import distance
print("Dataset preview:")
print(data.head())
manhattan_dist_matrix =
distance.squareform(distance.pdist(numeric_data.values, metric='cityblock'))
manhattan_dist_df = pd.DataFrame
(manhattan_dist_matrix) print("\nManhattan Distance
Matrix:") print(manhattan_dist_df)
Output:
Experiment – 10
Aim: Perform Apriori algorithm to mine frequent item-sets.
Theory:
The Apriori algorithm is a classic algorithm used in data mining for mining frequent itemsets and learning
association rules. It was proposed by R. Agrawal and R. Srikant in 1994. The algorithm is
particularly effective in market basket analysis, where the goal is to find sets of items that frequently co-
occur in transactions.
Key Concepts
1. Itemsets:
o An itemset is a collection of one or more items. For example, in a grocery store dataset,
an itemset might include items like {milk, bread}.
2. Frequent Itemsets:
o A frequent itemset is an itemset that appears in the dataset with a frequency greater than
or equal to a specified threshold, called support.
o The support of an itemset X is defined as the proportion of transactions in the dataset
that contain X.
3. Association Rules:
o An association rule is an implication of the form X→Y, indicating that the presence of
itemset X in a transaction implies the presence of itemset Y.
o Rules are evaluated based on two main metrics: support and confidence. The confidence
of a rule is the proportion of transactions that contain Y among those that contain X:
4. Lift:
o Lift measures the effectiveness of a rule over random chance and is defined as:
Output:
Experiment – 11
Aim: Develop different clustering algorithms like K-Means, KMedoids Algorithm, Partitioning Algorithm and
Hierarchical.
Theory:
Clustering is a core technique in data mining and machine learning, used to group similar data points into
clusters based on their features. This unsupervised learning method uncovers patterns and structures within
datasets without relying on predefined labels. Below, we explore four commonly used clustering algorithms:
K-Means, K-Medoids, Partitioning Algorithms, and Hierarchical Clustering.
1. K-Means Clustering
Overview: K-Means partitions a dataset into K distinct, non-overlapping clusters by minimizing the variance
within each cluster and maximizing the variance between clusters. It’s widely used due to its simplicity and
efficiency.
Algorithm Steps:
1. Initialization: Randomly select K initial centroids from the dataset.
2. Assignment: Assign each data point to the nearest centroid based on Euclidean distance, forming K
clusters.
3. Update: Compute new centroids as the mean of all points in each cluster.
4. Convergence: Repeat the assignment and update steps until centroids stabilize or a set number of
iterations is reached.
Strengths:
Simple and easy to implement.
Efficient for large datasets.
Scales well with data size.
Weaknesses:
Requires specifying the number of clusters (K) in advance.
Sensitive to initial centroid placement.
Prone to local minima convergence.
2. K-Medoids Clustering
Overview: Similar to K-Means, but uses actual data points (medoids) as cluster centers, making it more robust
to noise and outliers.
Algorithm Steps:
1. Initialization: Randomly select K medoids from the dataset.
2. Assignment: Assign each data point to the nearest medoid based on a chosen distance metric
(commonly Manhattan distance).
3. Update: For each cluster, select the point with the smallest total distance to all other points in the
cluster as the new medoid.
4. Convergence: Repeat the assignment and update steps until medoids no longer change.
Strengths:
More robust to outliers than K-Means.
Uses actual data points, avoiding issues with mean calculations in some contexts.
Weaknesses:
Computationally more expensive than K-Means.
Still requires specifying the number of clusters (K).
Overview: These algorithms partition the dataset into K clusters without a hierarchical structure, aiming to
minimize intra-cluster variance.
Common Approaches:
K-Means: Uses centroid distances.
CLARA (Clustering LARge Applications): Extends K-Medoids with sampling for scalability.
PAM (Partitioning Around Medoids): Chooses medoids to minimize total distance.
Strengths:
Flexible with different distance metrics.
Efficient for diverse datasets.
Weaknesses:
Requires defining the number of clusters beforehand.
May struggle with clusters of varying shapes or sizes.
4. Hierarchical Clustering
Overview: Builds a hierarchy of clusters either bottom-up (agglomerative) or top-down (divisive), without
needing a pre-defined number of clusters.
Types:
Agglomerative: Starts with individual points as clusters, merging the closest pairs until one cluster
remains or a stopping criterion is met.
Divisive: Starts with one cluster and splits it recursively into smaller clusters.
Strengths:
No need to specify the number of clusters upfront.
Produces a dendrogram, visually representing cluster relationships.
Weaknesses:
Computationally intensive (O(n²) complexity).
Sensitive to noise and outliers, potentially distorting the hierarchy.
Code:
#Generate a smaller sample dataset (you can use a subset of your original
data) sampled_data = X[:50] # Use first 50 data points from your dataset
#Initialize the medoid indices (choose random initial medoids from the
subset) initial_medoids = [0, 10, 20] # Choose medoid indices carefully
#Apply K-Medoids to the smaller dataset
kmedoids_instance = kmedoids (sampled_data, initial_medoids,
data_type='points') kmedoids_instance.process()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import
IsolationForest from sklearn.preprocessing
import StandardScaler