Data Mining notes: 7th semester.
CS 1435
Syllabus:
Structured: tables
Semi structured : tables, images
Unstructured: videos and stuff
Statistics: a branch of science dealing with collection of structured data oin huge amount.
Hypothesis: it is a study of data received. A statement and its outcome(See def.)
Null hypothesis: one variable doesn’t affect other.
------- hypothesis: one variable may affect other
Apart from AI, ML, stats, it also includes amths, dbms, etc
KDD- Knowledge discovering in db
Data cleaning – to remove the noise/ irrelevant data
uniform data in terms of key and data attributes
De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)
Data selection – extract pattern and knowledge, domain experts and team member and managers
decide
Data transformation – modification of data so that it can be evaluated. Eg. Some may write
rs/inr/etc/…
Data descrtisation – numerical data deal
Missing data – like I wont give my data to salesperson of my income, or entry time maybe.
Clean – fill misiing val, identify outliers
Ignore the tuple – when a % of data missing, then only ignore
Filling missing data not always feasible
Use a global constant to fill by like unknown, or NA, or ?
Probable value using decision tree or Bayesian theory
Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.
Clustering – detect and remove the outliers
Combined computer and human inspection – detect suspicious method
Regression – use regression functions while feeding data
Equal width = W = (B-A)/N B – highest value, N is interval size
Equal depth - Divides the range into n intervals, contains approx. same no. of samples
Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time
Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).
a)
Data cleaning overall:
Discrepancy detection – instrumntatition, human, outdated, inconsistecy
Unique rule – each value must be different for a particular attribute, consecutive rule – there
can be no missing value in between min and max value, null rule – specifies
blanks/?/sp.char/….. not available.use a global const
Tools – data scrubbing tool – useful in classification, data auditing tool, etl – transform
through a graphical user interface
Assignment 2: tool – explore them whether open source or commercial, salesforce.com explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.
Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-
versa
Handling redundant data: chi-square test
Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c
Eg 3.1 hellen kember data mining book
Data transformation: smoothing (regression, clustering, binning)
Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes
Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year
Normalization – scale it down acc to size of data. Acceptance of paper
Discretization – place at some range level. Like 1st yr students age 17-18.
Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,
india.
Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to
Min-max norm protects the out of bound values. It can detect out of bound errors
Eg3.4
z-score normalization –
data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.
Parameter – paratmeter of data is stored not totally data….non-para – clustering/regression etc
used.
Lossy – lossless: data loss or not on getting back original data
CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies
Learning state: we describe state of predetermined classes
Classification – for classifying future or unknown object
Eg. If a customer is good to give a loan or not. So customer is safe or not
Decision tree induction: classifier in the form of tree.
Learning of decision tree from class level . so greedy strategy needed.
Id3: iterative determiser – decision tree algo
C4.5: successor of id3
Cart : classification and regression tree.
Built decision tree algo: to select branch : choose most suitable attribute, then create data structure
Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value
E(x) = -