Q2.
Perform the following preprocessing tasks on the dirty_iris dataset i) Calculate the number
and percentage of observations that are complete.
ii) Replace all the special values in data with NA. iii) Define these rules in a separate text file and
read them. (Use editfile function in R (package editrules). Use similar function in Python).
Print the resulting constraint object. – Species should be one of the following values: setosa,
versicolor or virginica. – All measured numerical properties of an iris should be positive. – The
petal length of an iris is at least 2 times its petal width. – The sepal length of an iris cannot
exceed 30 cm. – The sepals of an iris are longer than its petals.
iv)Determine how often each rule is broken (violatedEdits). Also summarize and plot the result.
v) Find outliers in sepal length using boxplot and boxplot.stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the iris dataset
iris = sns.load_dataset('iris')
# i) Calculate the number and percentage of observations that are
complete
complete_observations = iris.dropna()
num_complete_observations = len(complete_observations)
percentage_complete = (num_complete_observations / len(iris)) * 100
print(f"Number of complete observations: {num_complete_observations}")
print(f"Percentage of complete observations: {percentage_complete:.2f}
%")
Number of complete observations: 150
Percentage of complete observations: 100.00%
# ii) Replace all the special values in data with NA
iris.replace(['?', '!','$','^'], pd.NA, inplace=True)
# iii) Define rules in a separate text file and read them
# Save the rules in a text file (e.g., rules.txt)
rules_filename = 'rules.txt'
with open(rules_filename, 'w') as file:
file.write("""
species: setosa, versicolor, virginica
numerical_properties: positive
petal_length: >= 2 * petal_width
sepal_length: <= 30
sepals_length_greater_than_petals: sepal_length >
petal_length""")
# Read rules from the text file
with open(rules_filename, 'r') as file:
rules = file.read()
# Print the resulting constraint object
print("Rules:", rules)
Rules:
species: setosa, versicolor, virginica
numerical_properties: positive
petal_length: >= 2 * petal_width
sepal_length: <= 30
sepals_length_greater_than_petals: sepal_length > petal_length
# iv) Determine how often each rule is broken
violated_rules_count = 0
# Rule 1: Species should be one of the following values
violated_rules_count += len(iris[~iris['species'].isin(['setosa',
'versicolor', 'virginica'])])
# Rule 2: All measured numerical properties of an iris should be
positive
numerical_properties = ['sepal_length', 'sepal_width', 'petal_length',
'petal_width']
violated_rules_count += len(iris[(iris[numerical_properties] <=
0).any(axis=1)])
# Rule 3: The petal length of an iris is at least 2 times its petal
width
violated_rules_count += len(iris[iris['petal_length'] < 2 *
iris['petal_width']])
# Rule 4: The sepal length of an iris cannot exceed 30 cm
violated_rules_count += len(iris[iris['sepal_length'] > 30])
# Rule 5: The sepals of an iris are longer than its petals
violated_rules_count += len(iris[iris['sepal_length'] <=
iris['petal_length']])
print(f"Number of violated rules: {violated_rules_count}")
Number of violated rules: 0
# v) Find outliers in sepal length using boxplot and boxplot.stats
ax = sns.boxplot(x='sepal_length', data=iris)
plt.title('Boxplot for Sepal Length')
plt.show()
# Get boxplot statistics
boxplot_stats = ax.get_lines()[0].get_ydata()
print("Boxplot Statistics:")
print(boxplot_stats)
Boxplot Statistics:
[0 0]
# Get boxplot statistics using get_lines()[0].get_ydata()
outliers = ax.get_lines()[0].get_ydata()
print("Outliers in Sepal Length:")
print(outliers)
Outliers in Sepal Length:
[0 0]