KEMBAR78
Data Warehousing and Data Mining Lab | PDF | Cluster Analysis | Statistical Classification
0% found this document useful (0 votes)
70 views63 pages

Data Warehousing and Data Mining Lab

Uploaded by

dhruvikamra26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views63 pages

Data Warehousing and Data Mining Lab

Uploaded by

dhruvikamra26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Warehousing and Data Mining Lab

Faculty Name: Ms. Jyoti Kaushik Student Name: Khushi Kamra


Roll No. : 35596402721
Semester: 7th
Group: MLDA 1 B

Maharaja Agrasen Institute of Technology, PSP Area, Sector - 22,


New Delhi – 110085
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
COMPUTER SCIENCE & ENGINEERING DEPARTMENT

VISION
“To be centre of excellence in education, research and technology transfer in the field
of computer engineering and promote entrepreneurship and ethical values.”

MISSION
“To foster an open, multidisciplinary and highly collaborative research environment to
produce world-class engineers capable of providing innovative solutions to real life
problems and fulfill societal needs.”
Department of Computer Science and Engineering
Rubrics for Lab Assessment

Rubrics 0 1 2 3
Missing Inadequate Needs Improvement Adequate
R1 Is able to No mention is An attempt is The problem to be The problem to be
identify the made of the made to identify solved is described solved is clearly stated.
problem to problem to be the problem to be but there are minor Objectives are
be solved and solved. solved but it is omissions or vague complete, specific,
define the described in a details. Objectives are concise, and
objectives of confusing manner, conceptually correct measurable. They are
the objectives are not and measurable but written using correct
experiment. relevant, may be incomplete in technical terminology
objectives contain scope or have and are free from
technical/ linguistic errors. linguistic errors.
conceptual errors
or objectives are
not measurable.
R2 Is able to The experiment The experiment The experiment The experiment solves
design a does not solve the attempts to solve attempts to solve the the problem and has a
reliable problem. the problem but problem but due to high likelihood of
experiment due to the nature the nature of the producing data that will
that solves of the design the design there is a lead to a reliable
the problem. data will not lead moderate chance the solution.
to a reliable data will not lead to a
solution. reliable solution
R3 Is able to Diagrams are Diagrams are Diagrams and/or Diagrams and/or
communicate missing and/or present but experimental experimental procedure
the details of experimental unclear and/or procedure are present are clear and complete.
an procedure is experimental but with minor
experimental missing or procedure is omissions or vague
procedure extremely vague. present but details.
clearly and important details
completely. are missing.
R4 Is able to Data are either Some important All important data are All important data are
record and absent or data are absent or present, but recorded present, organized and
represent incomprehensible. incomprehensible. in a way that requires recorded clearly.
data in a some effort to
comprehend.
meaningful
way.
R5 Is able to No discussion is A judgment is An acceptable An acceptable judgment
make a presented about made about the judgment is made is made about the
judgment the results of the results, but it is about the result, but result, with clear
about the experiment. not reasonable or the reasoning is reasoning. The effects
results of the coherent. flawed or incomplete. of assumptions and
experiment. experimental
uncertainties are
considered.
PRACTICAL RECORD

PAPER CODE : CIE-425P

Name of the student : Khushi Kamra

University Roll No. : 35596402721

Branch : CSE-2

Section/ Group : 7 MLDA 1 B

Exp. Experiment Name Date of Date of R1 R2 R3 R4 R5 Total Signature


No. performance checking (3) (3) (3) (3) (3) Marks
(15)
1. Study of ETL process and its tools.

2. Program of Data warehouse


cleansing to input names from
users (inconsistent) and format
them.
3. Program of Data warehouse
cleansing to remove redundancy in
data.
4. Introduction to WEKA tool.

5. Implementation of Classification
technique on ARFF files using
WEKA.
6. Implementation of Clustering
technique on ARFF files using
WEKA.

7. Implementation of Association Rule


technique on ARFF files using
WEKA.
8. Implementation of Visualization
technique on ARFF files using
WEKA.
9. Perform Data Similarity Measure
(Euclidean, Manhattan Distance).

10. Perform Apriori algorithm to mine


frequent item-sets.
11. Develop different clustering
algorithms like K-Means, KMedoids
Algorithm, Partitioning Algorithm
and Hierarchical
12. Apply Validity Measures to
evaluate the quality of Data
Experiment – 1
Aim: Study of ETL process and its tools.

Theory:
The ETL (Extract, Transform, Load) process is vital for data warehousing and integration, facilitating the
collection of data from multiple sources, transforming it into a structured format, and loading it into a central
repository for analysis, reporting, and decision-making. Here's a breakdown of each stage in the ETL process:
1. Extract
o Purpose: This step involves extracting data from various sources such as transactional
databases, spreadsheets, legacy systems, or cloud services.
o Process: Extraction methods vary depending on the data source. For databases, it might involve
SQL queries; for web sources, APIs are used; and for unstructured data, techniques like web
scraping are employed.
o Challenges: Dealing with multiple data sources often involves different formats and structures,
requiring careful handling to maintain accuracy and completeness.
2. Transform
o Purpose: Here, the extracted data is processed and converted into a usable format that meets
the requirements of the target data warehouse.
o Process: This involves data cleaning (removing duplicates, handling missing values), data
mapping, aggregation, and standardizing formats (like dates or currency).
o Techniques: Common transformations include filtering data, merging datasets, calculating new
metrics, applying business rules, and reformatting data types.
o Challenges: Ensuring data integrity and consistency requires a clear understanding of business
rules and data relationships.
3. Load
o Purpose: The final step is to load the transformed data into a target system, typically a data
warehouse or data mart, where it can be accessed for analysis and reporting.
o Process: Data loading can be done in two ways:
 Full Load: Loading all data at once, typically during initial setup or major data refreshes.
 Incremental Load: Loading only new or updated data to keep the dataset current
without redundancy.
o Challenges: Ensuring efficient, reliable data loads that meet performance requirements,
particularly in large-scale environments with high data volumes.
Common ETL Tools Several ETL tools are widely used in the industry, each offering unique features and
capabilities:
1. Informatica PowerCenter: Known for its robust data integration capabilities, data quality assurance,
and support for various data formats.
2. Apache NiFi: An open-source tool focused on automating data flows, with an emphasis on security and
scalability.
3. Microsoft SQL Server Integration Services (SSIS): A popular ETL tool provided by Microsoft, designed
for data migration, integration, and transformation tasks within the SQL Server environment.
4. Talend: An open-source ETL tool offering extensive data integration features suitable for data cleansing,
transformation, and loading in both cloud and on-premises environments.
5. Pentaho Data Integration (PDI): Part of the Pentaho suite, offering user-friendly data integration and
transformation features, supporting big data and analytics.
Importance of ETL in Data Warehousing The ETL process is crucial for building reliable and efficient
data warehouses. It ensures data from diverse sources is unified, cleaned, and standardized, providing a
single source of truth for analytics and reporting. This consistency is vital for businesses to make
accurate, data-driven decisions, ensuring all departments rely on the same high-quality data.
Additionally, the ETL process supports data governance and compliance, as transformations can enforce
data quality standards and regulatory requirements, which is essential for industries with strict
compliance regulations.

Additionally, the ETL process supports data governance and compliance, as transformations can enforce
data quality standards and regulatory requirements, which is essential for industries with strict
compliance regulations.

Modern ETL tools often support real-time data integration, enabling organizations to access and analyze
up-to-date information. This capability is particularly valuable for businesses that rely on timely data for
critical operations, such as financial services, healthcare, and e-commerce.

ETL processes play a crucial role in improving data quality by incorporating data cleansing and validation
steps. These processes identify and rectify inconsistencies, inaccuracies, and duplicates, ensuring that
the data used for analysis and reporting is accurate and reliable. This leads to more credible insights
and better-informed decision-making.

ETL processes also facilitate the management of historical data by archiving and maintaining data from
various time periods. This historical data can be invaluable for trend analysis, forecasting, and long-term
strategic planning, providing a comprehensive view of the organization's performance over time.

In summary, the ETL process is not just a technical necessity but a strategic asset for organizations
looking to harness the full potential of their data. It ensures that data is accessible, accurate, and
actionable, forming the backbone of effective data-driven decision-making. With the right ETL tools and
processes in place, businesses can achieve higher efficiency, improved data quality, and greater
flexibility, positioning themselves for success in an increasingly data-centric world.
Experiment – 2
Aim: Program of Data warehouse cleansing to input names from users (inconsistent) and format them.
Theory:
Data cleansing, also known as data cleaning or data scrubbing, is a crucial step in preparing data for analysis
and storage in a data warehouse. It involves identifying, correcting, or removing inaccuracies, inconsistencies,
and errors from datasets to ensure high data quality. Clean data is essential in a data warehouse because it
directly impacts the accuracy and reliability of insights derived from the data. Errors in data can lead to
misleading conclusions, faulty analysis, and suboptimal decision-making.
Importance of Data Cleansing
1. Enhances Data Quality: Clean data is accurate, consistent, and complete. Quality data provides reliable
insights and reduces the likelihood of errors in analytical outcomes.
2. Improves Decision Making: Accurate and consistent data lead to more effective decisions, driving
better business outcomes.
3. Boosts Efficiency: Data cleansing streamlines data processing by reducing redundancy and
standardizing data formats, making the data easier to analyze and interpret.
4. Maintains Consistency: Standardizing data formats across sources ensures uniformity, enabling
seamless integration in a data warehouse.
5. Ensures Compliance: Many industries require data accuracy for regulatory compliance. Data cleansing
ensures that data meets industry standards.
Common Issues in Data Cleansing
Inconsistent data can arise from various sources, including human error, different data formats, or legacy
systems. Some typical issues include:
1. Inconsistent Formatting: Variations in capitalization, spacing, or punctuation (e.g., "john doe" vs. "John
Doe").
2. Duplicate Entries: Repeated records for the same entity, leading to skewed analysis.
3. Incomplete Data: Missing information in one or more fields.
4. Incorrect Data: Values that do not match expected patterns or contain obvious errors (e.g., incorrect
phone numbers or email formats).
Data Cleansing Process
The data cleansing process typically includes these steps:
1. Data Profiling: Analyzing the data to understand its structure, content, and patterns. This helps identify
specific areas that require cleansing.
2. Data Standardization: Applying uniform formats to data, such as consistent capitalization, removing
special characters, or using standardized date formats.
3. Data Validation: Checking data against predefined rules or patterns to identify outliers or inaccuracies.
4. Data Enrichment: Filling missing information or correcting data using external reference data.
5. Data Deduplication: Identifying and removing duplicate records to avoid redundancy.
Tools for Data Cleansing
Many data integration and ETL tools offer data cleansing functionalities. Here are some widely used tools:
 Informatica Data Quality: Provides data profiling, cleansing, and standardization features, popular for
enterprise-level data cleansing.
 Trifacta: Known for its user-friendly interface, offering data profiling, transformation, and visualization
for cleansing workflows.
 OpenRefine: An open-source tool that allows users to clean and transform data in bulk, with features
for clustering similar values and removing duplicates.
 Python: Libraries like pandas offer robust data manipulation functions, enabling custom data cleansing
scripts tailored to specific needs.
Code:
def cleanse_name(name):
# Remove leading and trailing spaces
cleaned_name = name.strip()
# Replace multiple spaces with a single
space cleaned_name = "
".join(cleaned_name.split()) # Capitalize
each word in the name cleaned_name =
cleaned_name.title()
return cleaned_name

# Input: List of names with inconsistent


formatting names = [
" john doe ", # Extra spaces and
lowercase "MARy AnnE", # Mixed case
" alice JOHNSON ", # Extra spaces
"pETER o'CONNOR" # Mixed case and special character
]
# Apply cleansing function to each name
cleaned_names = [cleanse_name(name) for name in names]

# Display the cleaned names


print("Cleaned Names:")
for original, cleaned in zip(names, cleaned_names):
print(f"Original: '{original}' -> Cleaned:
'{cleaned}'")
Output:
Experiment – 3
Aim: Program of Data warehouse cleansing to remove redundancy in data.
Theory:

Redundancy in data warehousing occurs when duplicate records or entries are present, leading to inaccurate
analysis, increased storage costs, and performance inefficiencies. Redundant data can emerge from
integrating data from multiple sources, manual data entry errors, or other inconsistencies. Removing
redundancy is essential for maintaining data quality, improving efficiency in data processing, and ensuring
accurate analysis.
Common Techniques for Removing Redundancy
1. Deduplication: Identifying and removing duplicate records based on specific columns or combinations
of columns.
2. Primary Key Constraints: Ensuring unique identifiers (such as IDs) in a database to prevent duplicate
entries.
3. Data Merging and Consolidation: Aggregating or merging data from multiple sources and applying
deduplication rules.
4. Standardization: Normalizing data fields (such as name or address) to consistent formats, which helps
in identifying duplicates.
Benefits of Removing Redundancy
 Improved Data Accuracy: By eliminating duplicate entries, the data becomes more accurate, leading to
more reliable insights and decisions.
 Cost Efficiency: Reducing redundant data decreases storage requirements and associated costs.
 Enhanced Performance: Streamlined data sets improve database performance and reduce the time
needed for data processing tasks.
 Better Data Quality: Ensuring that only unique and accurate data is stored enhances the overall quality
of the data, making it more useful for analysis.
Challenges in Removing Redundancy
 Complexity in Identification: Identifying duplicate records, especially in large datasets with complex
structures, can be challenging.
 Maintaining Data Integrity: Ensuring that deduplication processes do not inadvertently remove valid
records or introduce errors.
 Consistency Across Systems: Achieving consistency in data formats and standards across different
sources and systems requires careful planning and execution.
Tools for Removing Redundancy
Several tools are widely used for data deduplication and cleansing, including:
 Informatica Data Quality: Offers comprehensive deduplication and data cleansing features, suitable for
large-scale enterprise environments.
 Trifacta: Provides user-friendly interfaces for data profiling and transformation, aiding in the
identification and removal of duplicates.
 OpenRefine: An open-source tool that supports bulk data cleaning, deduplication, and transformation
with powerful clustering algorithms.
Code:

import pandas as pd
# Sample dataset with duplicate
entries data = {
'CustomerID': [101, 102, 103, 104, 101, 102],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob'],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com',
'david@example.com', 'alice@example.com', 'bob@example.com'],
'PurchaseAmount': [250, 150, 300, 200, 250, 150]
}
# Load data into a
DataFrame df =
pd.DataFrame(data)

# Display the original data with duplicates


print("Original Data:")
print(df)

# Remove duplicates based on all columns


df_cleaned = df.drop_duplicates()

# Alternatively, remove duplicates based on specific columns, e.g.,


'CustomerID' and 'Name'
# df_cleaned = df.drop_duplicates(subset=['CustomerID', 'Name'])

# Display the cleaned data


print("\nCleaned Data (Duplicates Removed):")
print(df_cleaned)
Output:
Experiment
–4
Aim: Introduction to WEKA tool.
Theory:
WEKA (Waikato Environment for Knowledge Analysis) is a renowned open-source software suite for machine
learning and data mining, developed by the University of Waikato in New Zealand. Written in Java, it offers a
collection of tools for data pre-processing, classification, regression, clustering, association rules, and
visualization, making it ideal for educational and research purposes in data mining and machine learning.
Key Features of WEKA
1. User-Friendly Interface: WEKA provides a graphical user interface (GUI) that allows users to experiment
with different machine learning algorithms easily, without requiring extensive programming
knowledge. This interface includes panels such as Explorer, Experimenter, Knowledge Flow, and
Simple CLI (Command Line Interface).
2. Extensive Collection of Algorithms: WEKA includes a wide variety of machine learning algorithms,
including decision trees, support vector machines, neural networks, and Naive Bayes. These
algorithms can be applied to tasks such as classification, regression, and clustering, making WEKA a
versatile tool for various types of data analysis.
3. Data Pre-processing Tools: WEKA supports several data pre-processing techniques like normalization,
attribute selection, data discretization, and handling missing values. These capabilities help prepare
raw data for analysis, ensuring reliable and accurate model training.
4. File Format Compatibility: WEKA primarily works with ARFF (Attribute-Relation File Format) files, a text
format developed for WEKA’s datasets. It also supports CSV and other data formats, offering flexibility
for data import.
5. Visualization: WEKA includes data visualization tools that enable users to explore datasets graphically.
Users can view scatter plots, bar charts, and other visualizations, aiding in understanding data
distribution, patterns, and relationships between attributes.
6. Extendibility: As an open-source tool, users can add new algorithms or modify existing ones to meet
specific needs. Its integration with Java also enables users to incorporate WEKA into larger Java-based
applications.
Components of WEKA
1. Explorer: The primary GUI in WEKA, which provides a comprehensive environment for loading datasets,
pre-processing data, and applying machine learning algorithms. Users can evaluate model
performance easily using this component.
2. Experimenter: Allows users to perform controlled experiments to compare different algorithms or
configurations. This component is ideal for analyzing and optimizing model performance across
various datasets.
3. Knowledge Flow: Offers a data flow-oriented interface, similar to visual workflow tools, enabling users
to create complex machine learning pipelines visually. It is beneficial for designing, testing, and
implementing custom workflows.
4. Simple CLI: A command-line interface for advanced users to interact with WEKA’s functionalities using
commands. This component allows users to bypass the GUI for faster, script-based operations.
Common Applications of WEKA
 Educational Use: WEKA is widely used for teaching machine learning concepts due to its easy-to-
understand interface and a rich set of algorithms.
 Research: Researchers utilize WEKA to develop and test new machine learning models or compare the
performance of different algorithms.
 Real-World Data Mining: WEKA’s tools for classification, clustering, and association rule mining make it
applicable in real-world tasks, including medical diagnosis, market basket analysis, text classification,
and bioinformatics.
Advantages of WEKA
 Ease of Use: The GUI and pre-packaged algorithms make it accessible even to beginners in machine
learning and data mining.
 Wide Algorithm Support: A broad range of machine learning techniques are included, suitable for
diverse tasks and applications.
 Open-Source: Freely usable, modifiable, and integratable into other projects.
 Comprehensive Toolset: Includes tools for classification, regression, clustering, association rule mining,
and data visualization.
 Educational Value: Ideal for teaching due to its intuitive interface and rich set of algorithms.
 Community Support: A large user community provides support, tutorials, and additional resources.
 Extendibility: Users can add new algorithms or modify existing ones to suit their needs, thanks to its
open-source nature and Java integration.
 Cross-Platform Compatibility: Being Java-based, WEKA runs on any platform that supports Java,
ensuring wide accessibility.
Limitations of WEKA
 Scalability: Primarily designed for small to medium-sized datasets. Performance issues can arise with
big data applications, where tools like Apache Spark may be more suitable.
 Real-Time Processing: WEKA is geared towards batch processing and isn't ideal for real-time or
streaming data applications.
 Memory Consumption: Can be memory-intensive, which might be a limitation for larger datasets or
limited-resource environments.
 Learning Curve: While it's user-friendly, mastering all its features and understanding how to effectively
apply its wide range of algorithms can take time.
 Limited Data Visualization: Visualization tools, while useful, may not be as advanced or versatile as
specialized data visualization software.
 Single Machine Processing: Most WEKA operations run on a single machine, which limits scalability and
parallel processing capabilities.
 Performance Bottlenecks: For very large datasets, performance bottlenecks may occur, making it less
efficient compared to more scalable big data solutions.
 Compatibility with Modern Data Formats: While flexible, its primary file format, ARFF, may not be as
widely supported or convenient as formats used in newer tools.
Experiment – 5
Aim: Implementation of Classification technique on ARFF files using WEKA.
Theory:

Classification is a supervised machine learning technique used to categorize data points into predefined
classes or labels. By training on a labelled dataset, a classification algorithm learns patterns in the data to
predict the class labels of new, unseen instances. This technique is crucial in applications such as spam
detection, medical diagnosis, sentiment analysis, and image recognition.

Key Concepts in Classification

1. Supervised Learning

o In supervised learning, models are trained on a dataset with known labels. Each instance in the
training set includes features (input variables) and a target class label (output variable), allowing
the model to learn from examples.

2. Types of Classification

o Binary Classification: Involves two classes, such as "spam" vs. "not spam."

o Multiclass Classification: Involves more than two classes, like classifying images as "cat," "dog,"
or "bird."

o Multilabel Classification: Instances may belong to multiple classes simultaneously, such as a


news article categorized under "politics" and "finance."

3. Common Classification Algorithms

o Decision Trees: Utilize a tree-like model to make decisions based on attribute values, with nodes
representing features and branches indicating decision outcomes. Examples include CART and
C4.5 algorithms.

o Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature independence
within each class. It’s fast and effective for text classification tasks.

o k-Nearest Neighbors (k-NN): A non-parametric method that classifies a point based on the
majority label of its closest k neighbors in the feature space.

o Support Vector Machines (SVM): Finds a hyperplane that maximally separates classes in a high-
dimensional space, often effective for complex and high-dimensional data.

o Neural Networks: Complex models inspired by the human brain, effective for large datasets and
capable of learning intricate patterns through multiple hidden layers.

4. Training and Testing

o Training Set: The portion of data used to train the classification model.

o Testing Set: A separate portion of data used to evaluate model performance. Typically, data is
split into 70-80% for training and 20-30% for testing.

5. Evaluation Metrics

o Accuracy: The proportion of correctly classified instances out of all instances.

o Precision: The fraction of true positive predictions out of all positive predictions made (useful
when false positives are costly).

o Recall (Sensitivity): The fraction of true positive predictions out of all actual positives (useful
when false negatives are costly).

o F1 Score: The harmonic mean of precision and recall, balancing both metrics.

o Confusion Matrix: A matrix summarizing correct and incorrect predictions for each class,
providing insights into specific errors.

Steps in Classification using WEKA

1. Loading Data: Import ARFF files into WEKA.

2. Pre-processing: Clean and prepare data using WEKA’s pre-processing tools.

3. Algorithm Selection: Choose a classification algorithm based on your data and analysis needs.

4. Model Training: Train the model using the training set.

5. Model Testing: Test the model using the testing set and evaluate its performance with metrics like
accuracy, precision, recall, and F1 score.

6. Visualization: Use WEKA’s visualization tools to explore and interpret results.


Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.

2. Click on the "Open file" button to load your dataset.


3. Select the dataset and click open.

4. Click on “Classify” and choose the "IBK" algorithm from the "lazy" section of the classifiers.
5. Congifure the model by clicking on the “IBK” classifier.

6. Click on “Start” and WEKA will provide K-NN summary.


7. We can visualize the linear regression model in the visualization tab.
Experiment – 6
Aim: Implementation of Clustering technique on ARFF files using WEKA.
Theory:

Clustering is an unsupervised machine learning technique used to group similar data points into clusters.
The goal is to organize a dataset into distinct groups where data points in the same group (or cluster) are
more similar to each other than to those in other clusters. Unlike classification, clustering does not rely on
labelled data; instead, it explores inherent structures within the data itself.

Key Concepts in Clustering

1. Unsupervised Learning

o Clustering is unsupervised, meaning there is no predefined target variable or label. The


algorithm identifies patterns and structures based on feature similarities without external
guidance.

2. Types of Clustering Algorithms

o Clustering methods vary in how they define and identify clusters:

o k-Means Clustering:

 Partitions data into k clusters by minimizing the variance within each cluster.

 Starts by selecting k initial centroids (center points), assigns each data point to the
nearest centroid, and iteratively updates centroids based on cluster members.

o Hierarchical Clustering:

 Builds a tree-like structure of nested clusters (dendrogram) through either:

 Agglomerative (bottom-up): Each data point starts as its own cluster, which
gradually merges into larger clusters.

 Divisive (top-down): Starts with all data points in a single cluster and splits
them into smaller clusters.

 Often visualized through dendrograms, making it easy to see how clusters split or
merge at each level.

o Density-Based Clustering (DBSCAN):

 Groups points that are close to each other (dense regions) while marking points in
low-density regions as outliers.

 Effective for identifying arbitrarily shaped clusters and handling noise, unlike k-
means, which assumes spherical clusters.

o Gaussian Mixture Models (GMM):

 Assumes data is generated from multiple Gaussian distributions and uses


probabilistic methods to assign data points to clusters.

 Flexible and allows for overlapping clusters, making it useful for complex
distributions.
3. Distance Measures

o Clustering relies on measuring similarity between points, often using:

 Euclidean Distance: Common for k-means, measures straight-line distance.

 Manhattan Distance: Sum of absolute differences, useful in high-dimensional data.

 Cosine Similarity: For text or sparse data, measuring the angle between vectors
rather than direct distance.

Steps in Clustering using WEKA

1. Loading Data: Import ARFF files into WEKA.

2. Pre-processing: Clean and prepare data using WEKA’s pre-processing tools.

3. Algorithm Selection: Choose a clustering algorithm based on your data and analysis needs.

4. Model Training: Train the clustering model on your dataset.

5. Model Evaluation: Evaluate the model's performance using metrics like cluster purity, silhouette
score, or other relevant measures.

6. Visualization: Use WEKA’s visualization tools to explore and interpret clustering results.
Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.

2. Click on the "Open file" button to load your dataset.


3. Select the dataset and click open.

4. Click on “Cluster” and choose the "SimpleKMeans" algorithm from the "clusterers".
5. Congifure the model by clicking on “SimpleKMeans”.
6. Click on “Start” and WEKA will provide K-Means summary.
7. We can visualize the linear regression model in the visualization tab.
Experiment – 7
Aim: Implementation of Association Rule technique on ARFF files using WEKA.
Theory:

Association rule mining is a data mining technique used to discover interesting relationships, patterns, or
associations among items in large datasets. It is widely used in fields like market basket
analysis, where the goal is to understand how items co-occur in transactions. For instance, an association
rule might identify that "Customers who buy bread also tend to buy butter." Such insights can support
product placements, recommendations, and targeted marketing.
Key Concepts in Association Rule Mining
1. Association Rules:
o An association rule is typically in the form of X→Y, which means "If itemset X is present in
a transaction, then itemset Y is likely to be present as well."
o Each association rule is evaluated based on three main metrics:
 Support: The frequency with which an itemset appears in the dataset. Support
for X→Y is calculated as:

Higher support indicates that the rule is relevant for a larger portion of the dataset.
 Confidence: The likelihood of seeing Y in a transaction that already contains
X. Confidence for X→Y is calculated as:

Higher confidence means the rule is more reliable.


 Lift: Indicates the strength of an association by comparing the confidence of a
rule with the expected confidence if X and Y were independent. Lift is calculated
as:

A lift value greater than 1 indicates a positive association between X and Y.


2. Applications:
o Market Basket Analysis: Finds products frequently purchased together, guiding
product placement and promotions.
o Recommendation Systems: Suggests items based on co-occurrence with
previously purchased items, useful in e-commerce.
o Medical Diagnosis: Identifies symptom or condition associations that commonly
occur together.
o Fraud Detection: Detects patterns in fraudulent transactions by examining
recurring
3. Apriori Algorithm:
o The Apriori algorithm is a popular method for generating frequent itemsets and
association rules.
o Frequent Itemset Generation: Apriori starts by identifying single items that meet the
minimum support threshold, then expands to larger itemsets, adding items iteratively
while keeping only those itemsets that meet the support threshold.
o Rule Generation: For each frequent itemset, rules are created by dividing the itemset
into antecedents (left-hand side) and consequents (right-hand side) and calculating the
confidence for each possible rule.
4. Challenges:
o Large Dataset Size: Large datasets can generate a massive number of itemsets,
making computation and rule filtering challenging.
o Choosing Thresholds: Setting appropriate support and confidence thresholds is crucial, as
too high a threshold may yield too few rules, while too low may produce many
insignificant rules.
o Interpretability: Association rules must be interpretable to provide value,
requiring meaningful thresholds and metrics.

Steps:
1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
2. Click on the "Open file" button to load your dataset.

3. Select the dataset and click open.


4. Click on “Associate” and choose the "Apriori" algorithm from the "associations" section.

5. Congifure the model by clicking on “Apriori”.


6. Click on “Start” and WEKA will provide Apriori summary.
Experiment – 8
Aim: Implementation of Visualization technique on ARFF files using WEKA.
Theory:

Data visualization is the graphical representation of information and data. By employing visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to comprehend and
interpret trends, outliers, and patterns within data. This technique is crucial in data analysis, as it
transforms complex datasets into a more understandable and actionable format. Effective data
visualization not only enhances comprehension but also facilitates communication, exploration, and
decision-making processes.

Importance of Data Visualization

1. Enhances Understanding

o Visualization simplifies complex data by converting it into a visual format that is easier to
interpret. This is particularly important for large datasets with numerous variables. By
visualizing data, users can quickly grasp the underlying patterns and relationships that might
be difficult to detect in raw data form.

2. Reveals Insights

o Visualizations can uncover hidden insights, correlations, and trends that may not be
apparent in raw data. For instance, scatter plots can reveal relationships between two
variables, while heatmaps can display patterns across multiple dimensions. This ability to
surface insights makes data visualization a powerful tool in data exploration and analysis.

3. Facilitates Communication

o Effective visualizations convey information clearly and efficiently to stakeholders, enabling


better decision-making. Visual aids enhance presentations and reports, making data more
engaging and comprehensible. This is essential for communicating findings to non-technical
audiences or stakeholders who rely on data-driven insights.

4. Supports Exploration

o Interactive visualizations allow users to explore data from different angles, facilitating
discovery and hypothesis generation. Users can drill down into specific data segments, adjust
parameters to observe changes, and interact with the visualization to uncover new insights.
This exploratory capability is invaluable for in-depth data analysis.

Common Visualization Techniques

1. Bar Charts

o Bar charts display categorical data with rectangular bars representing the values. They are
effective for comparing different categories or showing changes over time. Bar charts can
be oriented vertically or horizontally, depending on the nature of the data and the
intended message.

2. Line Graphs

o Line graphs are used to display continuous data points over a period. They are ideal for
showing trends and fluctuations in data, such as stock prices or temperature changes over
time. Line graphs can help identify patterns, trends, and cycles in time-series data.
3. Histograms

o Histograms are similar to bar charts but are used for continuous data. They show the
frequency distribution of numerical data by dividing the range into intervals (bins) and
counting the number of observations in each bin. Histograms are useful for
understanding the distribution and variability of data.

4. Scatter Plots

o Scatter plots display the relationship between two continuous variables, with points
plotted on an x and y axis. They are useful for identifying correlations, trends, and
outliers. Scatter plots can highlight clusters, gaps, and potential anomalies in the data.

5. Box Plots

o Box plots summarize the distribution of a dataset by showing its median, quartiles, and
potential outliers. They are effective for comparing distributions across categories. Box
plots provide a clear summary of data dispersion and central tendency.

6. Heatmaps

o Heatmaps display data in matrix form, where individual values are represented by colors.
They are useful for visualizing correlations between variables or representing data
density in geographical maps. Heatmaps can quickly highlight areas of high and low
intensity within the data.

7. Pie Charts

o Pie charts represent data as slices of a circle, showing the proportion of each category
relative to the whole. They are best for displaying percentage shares but can be less
effective for comparing multiple values. Pie charts are useful for illustrating part-to-
whole relationships in a dataset.

8. Area Charts

o Area charts display quantitative data visually over time, similar to line graphs, but fill the
area beneath the line. They are useful for emphasizing the magnitude of values over
time. Area charts can show cumulative data trends and highlight the contribution of
individual segments to the whole.

Steps for Implementing Visualization Techniques using WEKA

1. Loading Data

o Import ARFF files into WEKA, which can be done through the Explorer interface. This step
involves selecting the dataset and loading it into the WEKA environment.

2. Data Pre-processing

o Clean and prepare the data using WEKA’s pre-processing tools. This may involve handling
missing values, normalizing data, and selecting relevant attributes for visualization.

3. Choosing Visualization Techniques

o Select the appropriate visualization techniques based on the data and analysis goals.
WEKA provides various visualization options, including histograms, scatter plots, and bar
charts.
4. Generating Visualizations

o Use WEKA’s visualization tools to generate the chosen visualizations. This involves
configuring the visualization parameters, such as selecting the attributes to be visualized
and setting axis scales.

5. Interpreting Visualizations

o Analyze the visualizations to identify patterns, trends, and insights. Interpret the results
to support decision-making and communicate findings effectively.

6. Exporting Visualizations

o Export the generated visualizations for use in reports, presentations, or further analysis.
WEKA allows users to save visualizations in various formats, making it easy to share and
integrate them into other documents.

Steps:

1. Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.

2. Click on the "Open file" button to load your dataset.


3. Select the dataset and click open.
4. Click on “Visualize all” on the bottom right side for viewing all the attributes present in the dataset.

5. We can visualize the linear regression model in the visualization tab.


6. We can also visualize the scatter plot between any 2 attributes from the dataset.
7. We can also select a particular section of the scatter plot to visualize separately by clicking
on “Select Instance”.

8. Select the area for visualization.


9. Click on “Submit”.
Experiment – 9
Aim: Perform Data Similarity Measure (Euclidean, Manhattan Distance).
Theory:

Data similarity measures quantify how alike two data points or vectors are. These measures are fundamental
in various fields, including machine learning, data mining, and pattern recognition. Among the most
commonly used similarity measures are Euclidean distance and Manhattan distance, which are often used to
assess the closeness of points in a multidimensional space.
1. Euclidean Distance
Definition: Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated using the Pythagorean theorem and is one of the most commonly used distance measures in
mathematics and machine learning.
Formula: For two points P and Q in an n-dimensional space, where P=(p1,p2,…,pn) and Q=(q1,q2,…,qn), the
Euclidean distance d is given by:

Properties:
 Non-negativity: d(P,Q)≥0
 Identity: d(P,Q)=0 if and only if P=Q
 Symmetry: d(P,Q)=d(Q,P)
 Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R) for any points P,Q,R
Applications:
 Used in clustering algorithms like K-means to determine the distance between points and centroids.
 Commonly applied in image recognition, pattern matching, and other areas requiring
spatial analysis.
2. Manhattan Distance
Definition: Manhattan distance, also known as the "city block" or "taxicab" distance, measures the distance
between two points by summing the absolute differences of their coordinates. It reflects the total grid
distance one would need to travel in a grid-like path.
Formula: For two points P and Q in an n-dimensional space, the Manhattan distance d is calculated as:

Properties:
 Non-negativity: d(P,Q)≥0
 Identity: d(P,Q)=0 if and only if P=Q
 Symmetry: d(P,Q)=d(Q,P)
 Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R)
Applications:
 Useful in scenarios where movement is restricted to a grid, such as in geographical data analysis.
 Often employed in clustering algorithms and machine learning models where linear
relationships are more meaningful than straight-line distances.
Comparison of Euclidean and Manhattan Distance
1. Geometric Interpretation:
o Euclidean distance measures the shortest path between two points, while
Manhattan distance measures the total path required to travel along axes.
2. Sensitivity to Dimensions:
o Euclidean distance can be sensitive to the scale of data and the number of dimensions, as
it tends to emphasize larger values. In contrast, Manhattan distance treats all dimensions
equally, summing absolute differences.
3. Use Cases:
o Euclidean distance is preferred in applications involving continuous data and
geometric spaces, whereas Manhattan distance is favored in discrete settings, such as
grid-based environments.
Code:
!pip install liac-arff pandas scipy
import arff
from google.colab import
files import pandas as pd
from scipy.spatial import distance

#Step 1: Upload and Load ARFF


File uploaded = files.upload()
arff_file = 'diabetes.arff'
#Load the ARFF file
with open(arff_file, 'r') as
f: dataset = arff.load(f)

data = pd.DataFrame(dataset ['data'], columns=[attr[0] for attr in dataset


['attributes']])

print("Dataset preview:")
print(data.head())

#Step 2: Select only numeric columns (Exclude categorical/string


columns) numeric_data = data.select_dtypes (include=[float, int])
print("\nNumeric columns:")
print(numeric_data.head())

#Step 3: Compute Euclidean and Manhattan Distances


point1 = numeric_data.iloc[0].values #First numeric data
point point2 = numeric_data.iloc[1].values #Second
numeric data point

euclidean_dist = distance.euclidean(point1, point2)


print(f'\nEuclidean Distance between the first two points: {euclidean_dist}')

manhattan_dist = distance.cityblock(point1, point2)


print(f'Manhattan Distance between the first two points: {manhattan_dist}')

#Step 4: Compute Pairwise Distance Matrices


euclidean_dist_matrix =
distance.squareform(distance.pdist(numeric_data.values, metric='euclidean'))
euclidean_dist_df = pd. DataFrame
(euclidean_dist_matrix) print("\nEuclidean Distance
Matrix:") print(euclidean_dist_df)

manhattan_dist_matrix =
distance.squareform(distance.pdist(numeric_data.values, metric='cityblock'))
manhattan_dist_df = pd.DataFrame
(manhattan_dist_matrix) print("\nManhattan Distance
Matrix:") print(manhattan_dist_df)

Output:
Experiment – 10
Aim: Perform Apriori algorithm to mine frequent item-sets.
Theory:
The Apriori algorithm is a classic algorithm used in data mining for mining frequent itemsets and learning
association rules. It was proposed by R. Agrawal and R. Srikant in 1994. The algorithm is
particularly effective in market basket analysis, where the goal is to find sets of items that frequently co-
occur in transactions.
Key Concepts
1. Itemsets:
o An itemset is a collection of one or more items. For example, in a grocery store dataset,
an itemset might include items like {milk, bread}.
2. Frequent Itemsets:
o A frequent itemset is an itemset that appears in the dataset with a frequency greater than
or equal to a specified threshold, called support.
o The support of an itemset X is defined as the proportion of transactions in the dataset
that contain X.

3. Association Rules:
o An association rule is an implication of the form X→Y, indicating that the presence of
itemset X in a transaction implies the presence of itemset Y.
o Rules are evaluated based on two main metrics: support and confidence. The confidence
of a rule is the proportion of transactions that contain Y among those that contain X:

4. Lift:
o Lift measures the effectiveness of a rule over random chance and is defined as:

A lift greater than 1 indicates a positive correlation between X and Y.

Apriori Algorithm Steps


The Apriori algorithm operates in two main phases: Frequent Itemset Generation and Association Rule
Generation.
Phase 1: Frequent Itemset Generation
1. Initialization:
o Scan the database to count the frequency of each individual item (1-itemsets) and
generate a list of frequent 1-itemsets based on the minimum support threshold.
2. Iterative Process:
o Generate candidate itemsets of length k (k-itemsets) from the frequent (k-1)-itemsets
found in the previous iteration.
o Prune the candidate itemsets by removing those that contain any infrequent subsets
(based on the Apriori property, which states that all subsets of a frequent itemset must
also be frequent).
3. Count Support:
o Scan the database again to count the support of the candidate itemsets.
o Retain those that meet or exceed the minimum support threshold, forming the set
of frequent k-itemsets.
4. Repeat:
o Repeat steps 2 and 3 for increasing values of k until no more frequent itemsets can be found.
Phase 2: Association Rule Generation
1. Rule Generation:
o For each frequent itemset, generate all possible non-empty subsets to create rules of
the form X→Y.
o Calculate the confidence for each rule and retain those that meet or exceed a
specified confidence threshold.
2. Evaluation:
o Evaluate the rules using metrics such as support, confidence, and lift to determine
their significance and usefulness.
Code:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Sample dataset: Transactions with items


bought data = {'Milk': [1, 1, 0, 1, 0],
'Bread': [1, 1, 1, 0, 1],
'Butter': [0, 1, 1, 1, 0],
'Beer': [0, 1, 0, 0, 1],
'Cheese': [1, 0, 0, 1, 1]}

# Convert the dataset into a


DataFrame df = pd.DataFrame(data)

# Display the dataset


print("Transaction Dataset:")
print(df)
#Step 1: Apply the Apriori algorithm to find frequent
item-sets # Minimum support = 0.6 (changeable)
frequent_itemsets = apriori(df, min_support=0.6,
use_colnames=True) # Display frequent item-sets
print("\nFrequent Item-
Sets:")
print(frequent_itemsets)

# Step 2: Generate the association rules based on the frequent


item-sets # Minimum confidence = 0.7 (changeable)
rules = association_rules (frequent_itemsets, metric="confidence",
min_threshold=0.7) # Display the association rules
print("\nAssociation Rules:")
print(rules [['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Output:
Experiment – 11
Aim: Develop different clustering algorithms like K-Means, KMedoids Algorithm, Partitioning Algorithm and
Hierarchical.
Theory:
Clustering is a core technique in data mining and machine learning, used to group similar data points into
clusters based on their features. This unsupervised learning method uncovers patterns and structures within
datasets without relying on predefined labels. Below, we explore four commonly used clustering algorithms:
K-Means, K-Medoids, Partitioning Algorithms, and Hierarchical Clustering.

1. K-Means Clustering

Overview: K-Means partitions a dataset into K distinct, non-overlapping clusters by minimizing the variance
within each cluster and maximizing the variance between clusters. It’s widely used due to its simplicity and
efficiency.

Algorithm Steps:
1. Initialization: Randomly select K initial centroids from the dataset.
2. Assignment: Assign each data point to the nearest centroid based on Euclidean distance, forming K
clusters.
3. Update: Compute new centroids as the mean of all points in each cluster.
4. Convergence: Repeat the assignment and update steps until centroids stabilize or a set number of
iterations is reached.

Strengths:
 Simple and easy to implement.
 Efficient for large datasets.
 Scales well with data size.

Weaknesses:
 Requires specifying the number of clusters (K) in advance.
 Sensitive to initial centroid placement.
 Prone to local minima convergence.

2. K-Medoids Clustering

Overview: Similar to K-Means, but uses actual data points (medoids) as cluster centers, making it more robust
to noise and outliers.

Algorithm Steps:
1. Initialization: Randomly select K medoids from the dataset.
2. Assignment: Assign each data point to the nearest medoid based on a chosen distance metric
(commonly Manhattan distance).
3. Update: For each cluster, select the point with the smallest total distance to all other points in the
cluster as the new medoid.
4. Convergence: Repeat the assignment and update steps until medoids no longer change.

Strengths:
 More robust to outliers than K-Means.
 Uses actual data points, avoiding issues with mean calculations in some contexts.
Weaknesses:
 Computationally more expensive than K-Means.
 Still requires specifying the number of clusters (K).

3. Partitioning Clustering Algorithms

Overview: These algorithms partition the dataset into K clusters without a hierarchical structure, aiming to
minimize intra-cluster variance.

Common Approaches:
 K-Means: Uses centroid distances.
 CLARA (Clustering LARge Applications): Extends K-Medoids with sampling for scalability.
 PAM (Partitioning Around Medoids): Chooses medoids to minimize total distance.

Strengths:
 Flexible with different distance metrics.
 Efficient for diverse datasets.

Weaknesses:
 Requires defining the number of clusters beforehand.
 May struggle with clusters of varying shapes or sizes.

4. Hierarchical Clustering

Overview: Builds a hierarchy of clusters either bottom-up (agglomerative) or top-down (divisive), without
needing a pre-defined number of clusters.

Types:
 Agglomerative: Starts with individual points as clusters, merging the closest pairs until one cluster
remains or a stopping criterion is met.
 Divisive: Starts with one cluster and splits it recursively into smaller clusters.

Strengths:
 No need to specify the number of clusters upfront.
 Produces a dendrogram, visually representing cluster relationships.

Weaknesses:
 Computationally intensive (O(n²) complexity).
 Sensitive to noise and outliers, potentially distorting the hierarchy.
Code:

!pip install sklearn


!pip install pyclustering
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans,
AgglomerativeClustering from
pyclustering.cluster.kmedoids import kmedoids
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import
kmeans_plusplus_initializer from pyclustering.utils import
calculate_distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate sample data


X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the generated data


plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Sample Data for
Clustering") plt.show()

#Apply K-Means algorithm


kmeans_model = KMeans
(n_clusters=4)
kmeans_model.fit(X)

#Get cluster labels and centroids


labels_kmeans =
kmeans_model.labels_
centroids_kmeans = kmeans_model.cluster_centers_

#Visualize the K-Means clustering result


plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, s=50, cmap='viridis')
plt.scatter(centroids_kmeans [:, 0], centroids_kmeans [:, 1], s=200,
c='red', label='Centroids')
plt.title("K-Means
Clustering") plt.legend()
plt.show()

#Apply K-Medoid algorithm


from pyclustering.cluster.kmedoids import
kmedoids import numpy as np
import matplotlib.pyplot as plt

#Generate a smaller sample dataset (you can use a subset of your original
data) sampled_data = X[:50] # Use first 50 data points from your dataset

#Initialize the medoid indices (choose random initial medoids from the
subset) initial_medoids = [0, 10, 20] # Choose medoid indices carefully
#Apply K-Medoids to the smaller dataset
kmedoids_instance = kmedoids (sampled_data, initial_medoids,
data_type='points') kmedoids_instance.process()

#Get the resulting clusters and medoids


clusters =
kmedoids_instance.get_clusters() medoids =
kmedoids_instance.get_medoids()

#Visualize the clusters and


medoids for cluster in clusters:
plt.scatter(sampled_data [cluster, 0], sampled_data [cluster, 1])
plt.scatter(sampled_data [medoids, 0], sampled_data [medoids, 1], c='red',
s=200,
label='Medoids')
plt.title("K-Medoids Clustering
(Optimized)") plt.legend()
plt.show()

#Apply the PAM algorithm


pam_instance = kmedoids (X, initial_medoids, data_type='points')
pam_instance.process() # Perform clustering

#Get the resulting clusters and


medoids clusters =
pam_instance.get_clusters() medoids =
pam_instance.get_medoids()

#Visualize the PAM clustering


result for cluster in clusters:
plt.scatter(X [cluster, 0], X [cluster, 1], s=50)
plt.scatter (X [medoids, 0], X [medoids, 1], s=200, c='red', marker='x',
label='Medoids')
plt.title("PAM (K-Medoids) Clustering")
plt.legend()
plt.show()

#Apply Hierarchical Clustering


linked = linkage(X,
method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending',
show_leaf_counts=True) plt.title("Dendrogram for Hierarchical Clustering")
plt.show()

#Perform agglomerative clustering to get cluster labels


hierarchical_model = AgglomerativeClustering(n_clusters=4,
metric='euclidean', linkage='ward')
labels_hierarchical = hierarchical_model.fit_predict(X)

#Visualize the clusters


plt.scatter(X[:, 0], X[:, 1], c=labels_hierarchical, s=50, cmap='viridis')
plt.title("Agglomerative Hierarchical Clustering")
plt.show()
Output:
Experiment – 12
Aim: Apply Validity Measures to evaluate the quality of Data.
Theory:
Evaluating data quality is a critical step in data analysis and machine learning, as it directly impacts the
performance of models and the validity of insights derived from the data. Various validity measures help
assess different aspects of data quality. Below, we outline key validity measures, their importance, and
how they are typically assessed in practice.
1. Missing Values
Definition: Missing values occur when data points for certain features are not recorded. They can
introduce bias and reduce the overall quality of the dataset.
Importance:
 A high proportion of missing values can lead to inaccurate models and biased results.
 Different strategies can be applied to handle missing values, including imputation (filling in missing
values), deletion (removing incomplete records), or using algorithms that can handle missing data.
Measure:
 The number of missing values per column provides insight into the extent of the issue. This can be
calculated using code to identify and quantify missing values in each feature of the dataset.
2. Duplicate Entries
Definition: Duplicate entries refer to identical records in a dataset. These can occur due to errors during
data collection or processing.
Importance:
 Duplicates can skew the results of analyses and lead to overfitting in machine learning models.
 Identifying and removing duplicates is crucial for maintaining data integrity and ensuring accurate
analysis.
Measure:
 The number of duplicate entries is checked and quantified, helping to identify and address this issue
in the dataset.
3. Outlier Detection
Definition: Outliers are data points that differ significantly from other observations in the dataset. They
can arise due to variability in measurement or may indicate a measurement error.
Importance:
 Outliers can disproportionately affect statistical analyses and model training, leading to inaccurate
predictions.
 Identifying outliers allows for further investigation to determine whether to keep, remove, or adjust
them.
Measure:
 Outliers can be detected using methods such as the Isolation Forest algorithm, which reports the
number of anomalies detected, helping to understand their presence in the dataset.
4. Multicollinearity
Definition: Multicollinearity occurs when two or more independent variables in a regression model are
highly correlated, meaning they contain similar information.
Importance:
 High multicollinearity can inflate the variance of coefficient estimates, making the model unstable
and reducing interpretability.
 It complicates the process of determining the importance of predictors.
Measure:
 The correlation matrix reveals the relationships between features, allowing for the identification of
highly correlated features. High correlation coefficients indicate potential multicollinearity.
5. Feature Distribution (Skewness)
Definition: Skewness measures the asymmetry of the probability distribution of a real-valued random
variable. A skewed distribution can indicate that certain transformations may be needed before modeling.
Importance:
 Features that are heavily skewed can violate the assumptions of certain statistical tests and machine
learning algorithms, impacting model performance.
 Understanding skewness helps in selecting appropriate preprocessing methods (e.g., normalization,
logarithmic transformation).
Measure:
 Skewness of each feature is calculated and visualized. Visualizing the feature distributions helps
identify heavily skewed variables, guiding necessary transformations.
6. Class Imbalance
Definition: Class imbalance occurs when the number of instances in each class of a classification problem
is not approximately equal. This is common in binary classification tasks.
Importance:
 Imbalanced classes can lead to biased models that perform well on the majority class but poorly on
the minority class.
 Evaluating the class distribution and applying techniques to handle imbalances, such as resampling
methods (oversampling, undersampling) or algorithm adjustments, is crucial.
Measure:
 The distribution of the target class is checked, helping to identify any potential imbalance that could
affect model training. Techniques such as the Gini coefficient or the use of confusion matrices can
also provide insights into class imbalance.
Code:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import
IsolationForest from sklearn.preprocessing
import StandardScaler

#Step 1: Generate sample data


X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9,
0.1], random_state=42)
df = pd.DataFrame (X, columns=[f"Feature_{i}" for i in
range(10)]) df ['Target'] = y

#Step 2: Introduce missing values for


illustration df.iloc[10:20, 0] = np.nan #
Create missing values

#Step 3: Data Quality Validity


Measures #1. Check for Missing Values
missing_values = df.isnull().sum()
print("\nMissing Values per Column:")
print(missing_values)

#2. Check for Duplicate Entries


duplicates = df.duplicated().sum()
print(f"\nNumber of Duplicate Entries: {duplicates}")

#3. Detect Outliers using Isolation Forest (anomaly detection) ) # Assuming 5% of


the data is outliers
iso_forest = IsolationForest(contamination=0.05)
outliers = iso_forest.fit_predict(df.drop('Target', axis=1).fillna(df.mean())) #
Fill missing values
outlier_count = sum(outliers == -1)
print(f"\nNumber of Outliers Detected: {outlier_count}")

#4. Check for Multicollinearity (Correlation Matrix)


correlation_matrix = df.drop('Target', axis=1).corr()
print("\nCorrelation Matrix (Multicollinearity Check):")
print(correlation_matrix)
# Visualize Correlation
Matrix
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Feature Correlation Matrix")
plt.show()

#5. Check for Feature Distribution


(Skewness) skewness = df.drop('Target',
axis=1).skew() print("\nSkewness of
Features:") print(skewness)
# Visualize Feature Distributions
df.drop('Target', axis=1).hist(figsize=(12,
10))
plt.suptitle("Feature Distribution (Check for Skewness)")
plt.show()

#6. Check for Class Imbalance


class_distribution = df ['Target'].value_counts
(normalize=True) print("\nClass Distribution (Imbalance
Check):") print(class_distribution)
Output:

You might also like