0% found this document useful (0 votes)

667 views6 pages

Data Mining Cheat Sheet

1. Data mining involves cleaning, integrating, selecting, transforming, mining, evaluating, and presenting data. 2. There are different types of data attributes like nominal, ordinal, interval, and ratio attributes. Distance measures are used to calculate similarities between data points. 3. Popular data mining algorithms include decision trees, naive Bayes classification, rule-based classification using algorithms like Apriori, and clustering techniques like k-means and hierarchical clustering.

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

667 views6 pages

Data Mining Cheat Sheet

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

1. Data Cleaning Removal of noise and inconsistent records Nomial e.g., ID numbers, eye color, zip codes

2. Data Integration Combing multiple sources Ordinal e.g., rankings, grades, height

3. Data Selection Only data relevant for the task are retrieved from Interval e.g., calendar dates, temperatures
the database
Ratio e.g., length, time, counts
4. Data Converting data into a form more appropriate for
Transformation mining Distance Measures
5. Data Mining Application of intelligent methods to extract data
patterns

6. Model Evaluation Identification of truly interesting patterns

representing knowledge

7. Knowledge Visualization or other knowledge presentation

Presentation techniques

Data mining could also be called Knowledge Discovery in Databases (see

kdnuggets.com)

Manhattan = City Block

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

measures
Data Mining Cheat Sheet

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance

agreement)

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

the sum of values in diagonal of the confusion matrix

K-Nearest Neighbor

* Compute distance between two points

* Determine the class from nearest neighbor list
* Take the majority vote of class labels
among the k-nearest neighbors
* Weigh the vote according to distance
Data Mining Cheat Sheet

K-Nearest Neighbor (cont) Bayesian Classification

* weight factor, w = 1 / d^2

Rule-based Classification

Classify records by using a collection of

“if…then…” rules
Rule: (Condition) --> y
where:
* Condition is a conjunction of attributes
* y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds
(Taxable Income < 50K) ^ (Refund=Yes) --> Evade=No
Sequential covering is a rule-based classifier.

Rule Evaluation

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already

happened.

Terms

Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Analysis monotone property

Ensemble Staking, Random Forest

Methods
Data Mining Cheat Sheet

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Algorithm

Model Cross-validation, Bootstrap, Leave-one out (C-V),

Evaluation Misclassification error rate, Repeated holdout,
Stratification

Bayes Probabilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel

Visualization coordinates

Nonlinear Principal components, ISOMAP, Multidimensional

Dimensionality scaling
Reduction

Ensemble Techniques Apriori Algorithm

Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets
of length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
Manipulate training data: bagging and boosting ensemble of “experts”,
leaving only those that are frequent
each specializing on different portions of the instance space

Manipulate output values: error-correcting output coding (ensemble of

“experts”, each predicting 1 bit of the {multibit} full class label)

Methods: BAGGing, Boosting, AdaBoost

Data Mining Cheat Sheet

K-means Clustering Dendrogram Example

Select K points as the initial centroids

repeat
Form K Clusters by assigning all points to the
closest centroid
Recompute the centroid of each cluster
until the centroids don't change

Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Cosine), correlation.

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}

Single-Link or MIN
Density-Based Clustering
Similarity of two clusters is based on the two most similar (closest /
minimum) points in the different clusters current_cluster_label <-- 1
Determined by one pair of points, i.e., by one link in the proximity graph. for all core points do
Complete or MAX
if the core point has no cluster label then
Similarity of two clusters is based on the two least similar (most distant,
current_cluster_label <--
maximum) points in the different clusters
current_cluster_label +1
Determined by all pairs of points in the two clusters
Group Average Label the current core point with the cluster

Proximity of two clusters is the average of pairwise proximity between label

points in the two clusters end if

Agglomerative clustering starts with points as individual clusters and for all points in the Eps-neighborhood, except i-

merges closest clusters until only one cluster left. th the point itself do
if the point does not have a cluster label
Divisive clustering starts with one, all-inclusive cluster and splits a then
cluster until each cluster only has one point. Label the point with cluster label
end if
end for
Data Mining Cheat Sheet

Density-Based Clustering (cont) Regression Analysis (cont)

end for | Elastic Net

DBSCAN is a popular algorithm

Anomaly Detection

Density = number of points within a specified radius (Eps) Anomaly is a pattern in the data that does not conform to the expected
behavior (e.g., outliers, exceptions, peculiarities, surprise)
A point is a core point if it has more than a specified number of points Types of Anomaly
(MinPts) within Eps oint: An individual data instance is anomalous w.r.t. the data
P
ontextual: An individual data instance is anomalous within a context
C
These are points that are at the interior of a cluster ollective: A collection of related data instances is anomalous
C
Approaches
A border point has fewer than MinPts within Eps, but is in the * Graphical (e.g., boxplots, scatter plots)
neighborhood of a core point * Statistical (e.g., normal distribution, likelihood)
| Parametric Techniques
A noise point is any point that is not a core point or a border point | Non-parametric Techniques
* Distance (e.g., nearest-neighbor, density, clustering)
Other Clustering Methods
Local outlier factor (LOF) is a density-based distance approach
Fuzzy is a partitional clustering method. Fuzzy clustering (also referred
to as soft clustering) is a form of clustering in that each data point can Mahalanobis Distance is a clustering-based distance approach
belong to more than one cluster.
Graph-based methods: Jarvis-Patrick, Shared-Near Neighbor (SNN,
Density), Chameleon
Model-based methods: Expectation-Maximization

Regression Analysis

* Linear Regression
| Least squares
* Subset selection
* Stepwise selection
* Regularized regression
| Ridge
| Lasso

AKTU MBA 1st Semester
No ratings yet
AKTU MBA 1st Semester
14 pages
Mathematical Foundations in Image Processing
No ratings yet
Mathematical Foundations in Image Processing
9 pages
CS3491 - Notes - Unit 3 - Supervised Learning
No ratings yet
CS3491 - Notes - Unit 3 - Supervised Learning
34 pages
Sentiment Analysis Over Online Product Reviews A Survey
No ratings yet
Sentiment Analysis Over Online Product Reviews A Survey
9 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
Probability and Queueing Theory Syllabus
No ratings yet
Probability and Queueing Theory Syllabus
2 pages
Ip Final Sem - Merged
No ratings yet
Ip Final Sem - Merged
36 pages
18bge14a U4
No ratings yet
18bge14a U4
16 pages
4주차 1 Intro Fuzzy
No ratings yet
4주차 1 Intro Fuzzy
37 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
1 - Performance Modelling Introduction
No ratings yet
1 - Performance Modelling Introduction
71 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
Data Recovery and Collection of Evidence
No ratings yet
Data Recovery and Collection of Evidence
14 pages
2 5244801349324911431 ١٠٢٨١٤
No ratings yet
2 5244801349324911431 ١٠٢٨١٤
62 pages
Question Bank Internal Image Processing
No ratings yet
Question Bank Internal Image Processing
6 pages
Goog Le Net
No ratings yet
Goog Le Net
30 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Histogram Processing
No ratings yet
Histogram Processing
17 pages
Useful Matlab Code
No ratings yet
Useful Matlab Code
5 pages
Adaptive Fuzzy Systems
No ratings yet
Adaptive Fuzzy Systems
19 pages
Eng4Bf3 Medical Image Processing: Image Enhancement in Frequency Domain
No ratings yet
Eng4Bf3 Medical Image Processing: Image Enhancement in Frequency Domain
59 pages
Reflection and Shear
No ratings yet
Reflection and Shear
7 pages
Image Enhancement Histogram Processing: by Dr. K. M. Bhurchandi
No ratings yet
Image Enhancement Histogram Processing: by Dr. K. M. Bhurchandi
25 pages
Dec 2019 CS463 - Digital Image Processing (R&S) - Ktu Qbank
No ratings yet
Dec 2019 CS463 - Digital Image Processing (R&S) - Ktu Qbank
2 pages
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
No ratings yet
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
1 page
Book 1
0% (1)
Book 1
416 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
44 pages
KCA 303 Unit-2
No ratings yet
KCA 303 Unit-2
32 pages
SPA Full Course PPTs (9 Files Merged)
No ratings yet
SPA Full Course PPTs (9 Files Merged)
239 pages
SVM Exam Paper For ABA Course To Be Returned With Answers Excel Exercise (25 Marks)
No ratings yet
SVM Exam Paper For ABA Course To Be Returned With Answers Excel Exercise (25 Marks)
12 pages
Anna University:: Chennai 600 025 Curriculum 2004 B.Tech. Information Technology Semester Iii Code No. Course Title L T P M Theory
No ratings yet
Anna University:: Chennai 600 025 Curriculum 2004 B.Tech. Information Technology Semester Iii Code No. Course Title L T P M Theory
39 pages
Automata & Compiler Design Course
No ratings yet
Automata & Compiler Design Course
41 pages
Data Mining Using Conceptual Clustering
No ratings yet
Data Mining Using Conceptual Clustering
29 pages
DataMining Workbook Answers
No ratings yet
DataMining Workbook Answers
18 pages
Fdsa Unit 3
No ratings yet
Fdsa Unit 3
42 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Unit 2 Omputer Network Aktu
100% (1)
Unit 2 Omputer Network Aktu
30 pages
SOFT COMPUTING - NOTES - UNIT 4 and UNIT 5
No ratings yet
SOFT COMPUTING - NOTES - UNIT 4 and UNIT 5
32 pages
Soft Computing Semester 30-Apr-2024 19-05-12
No ratings yet
Soft Computing Semester 30-Apr-2024 19-05-12
5 pages
Data Discretization Techniques
No ratings yet
Data Discretization Techniques
21 pages
Image Enhancement Techniques Guide
No ratings yet
Image Enhancement Techniques Guide
59 pages
CP1103 Unit - 1
No ratings yet
CP1103 Unit - 1
37 pages
Data Mining Handout
No ratings yet
Data Mining Handout
4 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
12 pages
Exam C1000 - 059 IBM AI Enterprise Workflow V1 Data Scientist Specialist
100% (1)
Exam C1000 - 059 IBM AI Enterprise Workflow V1 Data Scientist Specialist
6 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
IAT-I Question For MA3391 - P & S
No ratings yet
IAT-I Question For MA3391 - P & S
4 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Classification Exam Prep
No ratings yet
Classification Exam Prep
9 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
Iris Recognition
100% (1)
Iris Recognition
21 pages
Dashrath Nandan FCWMC (Complete) Notes
No ratings yet
Dashrath Nandan FCWMC (Complete) Notes
47 pages
Data Mining Cheat Sheet PDF
No ratings yet
Data Mining Cheat Sheet PDF
6 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
SAAI1-AI Analyst 2019-Course Guide 1
No ratings yet
SAAI1-AI Analyst 2019-Course Guide 1
166 pages
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
No ratings yet
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
22 pages
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
No ratings yet
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
12 pages
Welcome To The Course!: Michael (Mike) Gelbart
No ratings yet
Welcome To The Course!: Michael (Mike) Gelbart
17 pages
Regularized Logistic Regression Guide
No ratings yet
Regularized Logistic Regression Guide
19 pages
Scikit-Learn for Data Scientists
No ratings yet
Scikit-Learn for Data Scientists
32 pages
Importing Data in Python I: Introduction To Relational Databases
No ratings yet
Importing Data in Python I: Introduction To Relational Databases
33 pages
Supervised Learning With Scikit-Learn: Introduction To Regression
No ratings yet
Supervised Learning With Scikit-Learn: Introduction To Regression
31 pages
Introduction To Databases in Python: Calculating Values Ina Query
No ratings yet
Introduction To Databases in Python: Calculating Values Ina Query
30 pages
Introduction To Databases in Python: Creating Databases and Tables
No ratings yet
Introduction To Databases in Python: Creating Databases and Tables
31 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Introduction To Databases in Python: Filtering and Targeting Data
No ratings yet
Introduction To Databases in Python: Filtering and Targeting Data
32 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Fill The Gaps With The Correct Pronouns
No ratings yet
Fill The Gaps With The Correct Pronouns
2 pages
Viva Voce
No ratings yet
Viva Voce
3 pages
ANGL - TA - Writing An Essay
No ratings yet
ANGL - TA - Writing An Essay
2 pages
Wa0006
No ratings yet
Wa0006
52 pages
TTCAG 110 - Manual
No ratings yet
TTCAG 110 - Manual
54 pages
Peshat and Exegesis
No ratings yet
Peshat and Exegesis
13 pages
English Week Spelling Bee List
No ratings yet
English Week Spelling Bee List
2 pages
Presentation - Durga Puja - A Celebration of Divine Power and Victory
0% (1)
Presentation - Durga Puja - A Celebration of Divine Power and Victory
9 pages
ELECTION LAW Summary Nachura
No ratings yet
ELECTION LAW Summary Nachura
18 pages
Newslore Contemporary Folklore On
100% (1)
Newslore Contemporary Folklore On
279 pages
Assignment (Villarica v. SSS)
100% (1)
Assignment (Villarica v. SSS)
2 pages
Intra Oral Radiography
100% (1)
Intra Oral Radiography
11 pages
One Page PhD Proposal Guide
No ratings yet
One Page PhD Proposal Guide
1 page
Get Software Process Improvement and Capability Determination 18th International Conference SPICE 2018 Thessaloniki Greece October 9 10 2018 Proceedings Ioannis Stamelos Free All Chapters
100% (4)
Get Software Process Improvement and Capability Determination 18th International Conference SPICE 2018 Thessaloniki Greece October 9 10 2018 Proceedings Ioannis Stamelos Free All Chapters
65 pages
PT2 Datesheet 2024-25
No ratings yet
PT2 Datesheet 2024-25
1 page
Annual Report 2021
No ratings yet
Annual Report 2021
506 pages
ATI System Disorder Template
No ratings yet
ATI System Disorder Template
1 page
Module For Grade 4
No ratings yet
Module For Grade 4
62 pages
Invoice PDF
No ratings yet
Invoice PDF
1 page
Motivations of BTM Students to Travel
No ratings yet
Motivations of BTM Students to Travel
20 pages
Probability Concepts for Students
No ratings yet
Probability Concepts for Students
4 pages
4th Year CBTP Document
100% (1)
4th Year CBTP Document
20 pages
Chapter 08
No ratings yet
Chapter 08
14 pages
Erlang C Table PDF
0% (1)
Erlang C Table PDF
2 pages
Seek The Truth Unraveling Frozen2-2nd Edition-Yumeka
0% (1)
Seek The Truth Unraveling Frozen2-2nd Edition-Yumeka
49 pages
Ee Time Now and Then: How Has Leisure Time Changed Over The Years?
No ratings yet
Ee Time Now and Then: How Has Leisure Time Changed Over The Years?
15 pages
Branded Tea Buying Factors Study
No ratings yet
Branded Tea Buying Factors Study
46 pages
Cabuyao vs. Caagbay
No ratings yet
Cabuyao vs. Caagbay
9 pages
CdS Thin Film Optical Study via CBD
No ratings yet
CdS Thin Film Optical Study via CBD
6 pages

Data Mining Cheat Sheet

Uploaded by

Data Mining Cheat Sheet

Uploaded by

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

6. Model Evaluation Identi​fic​ation of truly intere​sting patterns

7. Knowledge Visual​ization or other knowledge presen​tation

Data mining could also be called Knowledge Discovery in Databases (see

Manhattan = City Block

Jaccard coeffi​cient, Hamming, Cosine are a similarity / dissim​ilarity

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

* Compute distance between two points

K-Nearest Neighbor (cont) Bayesian Classi​fic​ation

​ ​ ​ ​ ​ ​ ​ * weight factor, w = 1 / d^2

Classify records by using a collection of

p(a,b) is the probab​ility that both a and b happen.

p(a|b) is the probab​ility that a happens, knowing that b has already

Associ​ation Min-Ap​riori, LIFT, Simpson's Paradox, Anti-

Ensemble Staking, Random Forest

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessim​istic estimate, Occam's Razor, Hunt's

Model Cross-​val​ida​tion, Bootstrap, Leave-one out (C-V),

Bayes Probab​ilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel

Nonlinear Principal compon​ents, ISOMAP, Multid​ime​nsional

Ensemble Techniques Apriori Algorithm

Mani​pulate output values: error-​cor​recting output coding (ensemble of

Meth​ods: BAGGing, Boosting, AdaBoost

K-means Clustering Dendrogram Example

Select K points as the initial centroids

Clos​eness is measured by distance (e.g., Euclid​ean), similarity (e.g.,

Cent​roid is typically the mean of the points in the cluster

Hierar​chical Clustering Data​set: {7, 10, 20, 28, 35}

Proximity of two clusters is the average of pairwise proximity between label

Densit​y-Based Clustering (cont) Regression Analysis (cont)

end for ​ ​| El​astic Net

DBSCAN is a popular algorithm

You might also like

6. Model Evaluation Identification of truly interesting patterns

7. Knowledge Visualization or other knowledge presentation

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

K-Nearest Neighbor (cont) Bayesian Classification

* weight factor, w = 1 / d^2

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already

Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Model Cross-validation, Bootstrap, Leave-one out (C-V),

Bayes Probabilistic classifier

Nonlinear Principal components, ISOMAP, Multidimensional

Manipulate output values: error-correcting output coding (ensemble of

Methods: BAGGing, Boosting, AdaBoost

Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}

Density-Based Clustering (cont) Regression Analysis (cont)

end for | Elastic Net