1. Give 5 properties of Python and explain why Python is suitable for Data Mining?
Python is easy to use, object oriented, easy to read, expressive, open source, portable programming
language which has a lot of libraries for data mining algorithms
2. Write the output of the following code.
<class ‘tuple’>
[ 2.0, 100, 5]
3. (10 pts) Please make the necessary change in the given code so that it doesn’t give the following
error message and works as commented:
Line 1 must be → import pandas as pd
4. What is the output of the following code? (Hint if the loop test is False then the execution jumps to
the else: row.)
4 320
5. What is the output of the following code?
True
False
None
6. We are writing a sublist function which compares two lists and returns true if the first list (lst1) is
a sublist of (is contained inside) the second list (lst2). We created a version of the second list as ls2
where we eliminated all elements of lst2 which are not in the in the first list to e if the final lists are
the same. However even though the final lists contain the same elements,
This output needs to be True, since elements of
list [2,1] are also elements of list [1,2,5,3]
What property of lists can we use in the comparison ( ?==? ) so that function gives correct result:
(True) in the given example above.
Line 4 must be → return sorted(lst1) == sorted(ls2)
Note: Another sublist function given in the Apriori algorithm code runs faster.
7. What is the output of the following code?
25
81
75
8. a) Describe the difference between unsupervised and supervised
techniques of Data Mining, give an example for each. b) define overfitting
(o.f.), for which of the above techniques o.f. is a problem?
a) Supervised techniques can be used when labeled dataset is available for
training and testing where as unsupervised techniques doesn’t have/need a
labeled dataset. The unsupervised techniques are used to detect new patterns
and clusters in relatively unknown or unstructured data and supervised
techniques are used to predict future data when there is enough structured
and analyzed past information. K means clustering is an unsupervised
technique, decision tree analysis is an example of a supervised technique.
b) Overfitting is a problem of supervised techniques where the model is too
much customized for the specific training data at hand where as it doesn’t
perform well on the test data and future data.
9. Describe predictive modeling process. Which techniques are most suitable
to model datasets with nominal categories? Give 2 examples for these
techniques.
In predictive modeling a labeled dataset is split into two parts as training and
test datasets. A model is built using the training data set and test dataset is
fed into the model to predict their labels. Actual labels of test dataset and
predicted labels are compared to evaluate the performance of the model.
Classification techniques are most suited for predicting or describing data sets
with binary or nominal categories. Decision Trees and Rule Based Classifiers
are examples.
10. At one stage in K-Means Clustering of the given data set with two
attributes, distances of the points to each centroid are given in the following
table: What will be the centroid coordinates in the next stage?
Id x y distance_from_1 distance_from_2 distance_from_3
0 12 39 26,93 56,08 56,73
1 28 30 14,14 41,76 53,34
2 29 54 38,12 40,80 34,06
3 24 55 39,05 45,88 37,44
4 45 63 50,70 31,14 16,40
5 52 70 59,93 32,25 6,71
6 52 63 53,71 26,40 13,34
7 55 58 51,04 20,62 18,00
8 53 23 27,89 24,21 53,04
9 55 14 29,07 30,87 62,00
10 64 19 38,12 23,35 57,71
11 69 7 43,93 35,01 70,41
To find the updated centroid coordinates we first assign points to the
existing centroids (check which point is closest to which
centroid,ex:points 0,1 and 9 are closest to C1 now we take arithmetic
mean of x and y coordinates of these points to find updated
coordinates of Centroid1)
C1=[a.m.(X0,X1,X9), a.m.(Y0,Y1,Y9)] =
[(12+28+55)/3,(39+30+14)/3]
Similarly
C2=[a.m.(X8,X10,X11), a.m.(Y8,Y10,Y11) ]=
[(53+64+69)/3,(23+19+7)/3]
and
C3=[a.m.(X2,X3,X4, X5,X6,X7), a.m.(Y2,Y3,Y4, Y5,Y6,Y7) ]=
[(29+24+45+52+52+55)/6,(54+55+63+70+63+58)/6]
Answer:
[31.67, 27.67], [62.0, 16.33], [42.83, 60.5] ]
11. How many different splits can be made on the dataset given below.
Note: Use the “Entropy” measure for information gain given by the following
formula:
Id a1 a2 a3 Class
1 T T 1.0 +
2 T T 6.0 +
3 T F 5.0 -
4 F F 4.0 +
5 F T 7.0 -
6 F T 3.0 -
7 F F 8.0 -
8 T F 7.0 +
9 F T 5.0 -
8 splits are possible:
Entropy original: -4/9 * log2(4/9) – 5/9* log2(5/9) = 0.9911
Split information gains
Ex. a1 children entropy= 4/9 * Entropy(3+,1-) + 5/9 * Entropy(1+,4-)
= 4/9* [-3/4 * log2(3/4) – 1/4* log2(1/4)] + 5/9 * [-1/5 * log2(1/5) – 4/5*
log2(4/5)] = 0.7616
Ex. a2 children entropy = 5/9 * Entropy(2+,3-) + 4/9 * Entropy(2+,2-)
= 5/9* [-2/5 * log2(2/5) – 3/5* log2(3/5)] + 4/9 * [-1/2 * log2(1/2) – 1/2*
log2(1/2)] =0.9839
a1 info gain: E.O.- (a1 children entropy) = E.O.- (0.7616)= 0.2294
a2 info gain: E.O.- (a2 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=3.0 i.g.: E.O.- (a3 >=3.0 children entropy) = E.O.- (0.8484)= 0.1427
a3 >=4.0 i.g.: E.O.- (a3 >=4.0 children entropy) = E.O.- (0.9885)= 0.0026
a3 >=5.0 i.g.: E.O.- (a3 >=5.0 children entropy) = E.O.- (0.9183)= 0.0728
a3 >=6.0 i.g.: E.O.- (a3 >=6.0 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=7.0 i.g.: E.O.- (a3 >=7.0 children entropy) = E.O.- (0.9728) = 0.0183
a3 >=8.0 i.g.: E.O.- (a3 >=8.0 children entropy) = E.O.- (0.8889) = 0.1022