KEMBAR78
KNN-SVM Assignment | PDF | Support Vector Machine | Computational Neuroscience
0% found this document useful (0 votes)
38 views4 pages

KNN-SVM Assignment

The assignment involves implementing KNN and SVM classifiers using two datasets: car-evaluation and Breast Cancer Wisconsin. It includes tasks such as data preparation, model training, feature selection, and performance analysis, focusing on accuracy, training time, and the impact of different parameters. The final conclusions highlight SVM's strengths in high-dimensional spaces and its weaknesses with noisy or overlapping data.

Uploaded by

harithmsylhy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views4 pages

KNN-SVM Assignment

The assignment involves implementing KNN and SVM classifiers using two datasets: car-evaluation and Breast Cancer Wisconsin. It includes tasks such as data preparation, model training, feature selection, and performance analysis, focusing on accuracy, training time, and the impact of different parameters. The final conclusions highlight SVM's strengths in high-dimensional spaces and its weaknesses with noisy or overlapping data.

Uploaded by

harithmsylhy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment: Classification (KNN - SVM)

Dataset:
During this assignment you will use:
- Data User Modeling Dataset (use car-evaluation) is used. Training and test splits are
provided in csv file format for task 1.
- Data User Modeling Dataset (Breast Cancer Wisconsin (Diagnostic)) is used. Training and
test splits are provided in csv file format for task 2.
You will find them on the drive.

Task 1:
Use scikit-learn or other python packages to implement a KNN classifier (KNeigh-
borsClassifier). In this question, we use car-evaluation-dataset,
(a) In this dataset, there are 1728 samples in total. Firstly, you need to shuffle the
dataset and split the dataset into a training set with 1000 samples and a validation set
with 300 samples and a testing set with 428 samples. Use python to implement this data
preparation step.
(b) Since some attributes are represented by string values. If we choose a distance
metric like Euclidean distance, we need to transform the string values into numbers. Use
python to implement this preprocessing step.
(c) Try to use different number of training samples to show the impact of number of
training samples. Use 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of the
training set for 10 separate KNN classifiers and show their performance (accuracy score)
on the validation set and testing set. You can specify a fixed K=2 value (nearest
neighbor) in this question. Notably, X axis is the portion of the training set, Y axis should
be the accuracy score. There should be two lines in total, one is for the validation set
and another is for the testing set.
(d) Use 100% of training samples, try to find the best K value, and show the accuracy
curve on the validation set when K varies from 1 to 10.
(e) Analysis the training time when use different number of training samples. Consider
the following 4 cases:
• 10% of the whole training set and K = 2
• 100% of the whole training set and K = 2
• 10% of the whole training set and K = 10
• 100% of the whole training set and K = 10.
Plot a bar chart figure to show the prediction time on the testing set.
(f) Provide your conclusions from the experiments of question (c), (d) and (e) in this
question.

Task 2:
1.
• Load the dataset and convert categorical class labels under the target column to
numerical values by using the LabelEncoder.
• Choose two features from dataset to apply SVM and Logistic Regression
algorithms for classification. Plot the data by showing classes separately. Explain
how and why you chose the two features?
• Classify testing data by using SVM and Logistic Regression classifiers. Provide
accuracies.
2.
➢ Shuffling and Splitting:
o The dataset contains 569 samples.
o Shuffle all the dataset and split it into:
▪ Training set: 400 samples
▪ Validation set: 100 samples
▪ Testing set: 69 samples
➢ Preprocessing:
o Standardize the numerical features (use StandardScaler from Scikit-learn).
➢ Feature Selection:

o Perform feature selection using correlation or other methods to identify


the most important features.
o Visualize the dataset in a 2D plot with the chosen features, showing classes
separately

3.

➢ Linear Kernel and Decision Boundary:

• Train an SVM classifier with a linear kernel and visualize the decision boundaries.
• Explain the results: Is the dataset linearly separable? How does SVM handle this?

➢ RBF Kernel and Decision Boundary:

• Train an SVM classifier with an RBF kernel and visualize the decision boundaries.
• Compare the results with the linear kernel. Discuss how the RBF kernel uses the kernel trick
to map data into a higher-dimensional space.

➢ poly Kernel and Decision Boundary:

• Train an SVM classifier with a poly kernel and visualize the decision boundaries.
• Compare the results with the linear kernel. Discuss how the poly kernel handle this?

➢ Compare kernels accuracy scores on the validation and testing sets using default
hyperparameters.
➢ Use grid search to tune the following hyperparameters:

• C: Test values [0.01, 0.1, 1, 10, 100]. Plot the accuracy score on the validation set as C varies
• gamma: Test values [0.001, 0.01, 0.1, 1]. Plot the accuracy score on the validation set as
gamma varies.

4. Performance Analysis:

➢ Training Time and Prediction Time:

• Measure and analyze the training and prediction times for SVM under the
following scenarios:
o Case 1: 10% of the training set and kernel=linear, C=1.
o Case 2: 100% of the training set and kernel=linear, C=1.
o Case 3: 10% of the training set and kernel=rbf, C=1, gamma=0.01.
o Case 4: 100% of the training set and kernel=rbf, C=1, gamma=0.01.
• Visualization: Plot a bar chart to show the training time and prediction time
for each of these scenarios.

5. Exploring the Weaknesses of SVM

➢ Overlapping Classes:

• Manipulate the feature values (e.g., reduce separability between some wine
quality scores) to create overlapping classes.
• Train the classifier again and discuss the impact on accuracy. How does the
performance drop when classes overlap?

➢ Noisy Data:
• Add Gaussian noise to 10% of the samples in the training set.
• Retrain the SVM classifier with the RBF kernel and tuned parameters.
• Report accuracy on the validation and testing sets, and compare with the
results from the clean dataset.

6. Conclusions:
After running your code, you can summarize:

• SVM's strengths: Good for high-dimensional spaces, especially when classes are
separable.
• SVM's weaknesses: Struggles with noisy or overlapping data, leading to
decreased accuracy.

You might also like