0% found this document useful (0 votes)

81 views4 pages

Data Mining Notes: 7 Semester. CS 1435: Syllabus

This document contains notes on data mining concepts including: 1) Structured, semi-structured, and unstructured data types. Statistics, hypothesis testing, and null/alternative hypotheses are discussed. 2) Key steps in the data mining process are covered including data cleaning, transformation, discretization, and integration techniques. 3) Classification algorithms such as decision trees, ID3, C4.5 and CART are summarized. Concepts like entropy are introduced. 4) Assignments are outlined applying binning, data cleaning tools, and decision trees to sample datasets.

Uploaded by

exe37 eatler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views4 pages

Data Mining Notes: 7 Semester. CS 1435: Syllabus

Uploaded by

exe37 eatler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Mining notes: 7th semester.

CS 1435
Syllabus:
Structured: tables

Semi structured : tables, images

Unstructured: videos and stuff

Statistics: a branch of science dealing with collection of structured data oin huge amount.

Hypothesis: it is a study of data received. A statement and its outcome(See def.)

Null hypothesis: one variable doesn’t affect other.

------- hypothesis: one variable may affect other

Apart from AI, ML, stats, it also includes amths, dbms, etc

KDD- Knowledge discovering in db

Data cleaning – to remove the noise/ irrelevant data

uniform data in terms of key and data attributes

De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)

Data selection – extract pattern and knowledge, domain experts and team member and managers
decide

Data transformation – modification of data so that it can be evaluated. Eg. Some may write
rs/inr/etc/…

Data descrtisation – numerical data deal

Missing data – like I wont give my data to salesperson of my income, or entry time maybe.

Clean – fill misiing val, identify outliers

Ignore the tuple – when a % of data missing, then only ignore

Filling missing data not always feasible

Use a global constant to fill by like unknown, or NA, or ?

Probable value using decision tree or Bayesian theory

Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.

Clustering – detect and remove the outliers

Combined computer and human inspection – detect suspicious method

Regression – use regression functions while feeding data

Equal width = W = (B-A)/N B – highest value, N is interval size

Equal depth - Divides the range into n intervals, contains approx. same no. of samples

Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time

Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).

Data cleaning overall:

 Discrepancy detection – instrumntatition, human, outdated, inconsistecy

Unique rule – each value must be different for a particular attribute, consecutive rule – there
can be no missing value in between min and max value, null rule – specifies
blanks/?/sp.char/….. not available.use a global const
 Tools – data scrubbing tool – useful in classification, data auditing tool, etl – transform
through a graphical user interface

Assignment 2: tool – explore them whether open source or commercial, salesforce.com explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.

Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-
versa

Handling redundant data: chi-square test

Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c

Eg 3.1 hellen kember data mining book

Data transformation: smoothing (regression, clustering, binning)

Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes

Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year

Normalization – scale it down acc to size of data. Acceptance of paper

Discretization – place at some range level. Like 1st yr students age 17-18.

Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,
india.

Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to

Min-max norm protects the out of bound values. It can detect out of bound errors

Eg3.4

z-score normalization –

data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.

Parameter – paratmeter of data is stored not totally data….non-para – clustering/regression etc

used.

Lossy – lossless: data loss or not on getting back original data

CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies

Learning state: we describe state of predetermined classes

Classification – for classifying future or unknown object

Eg. If a customer is good to give a loan or not. So customer is safe or not

Decision tree induction: classifier in the form of tree.

Learning of decision tree from class level . so greedy strategy needed.

Id3: iterative determiser – decision tree algo

C4.5: successor of id3

Cart : classification and regression tree.

Built decision tree algo: to select branch : choose most suitable attribute, then create data structure

Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value

E(x) = -

CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
DMTN
No ratings yet
DMTN
17 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
206 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Prep
No ratings yet
Data Prep
33 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
Data Mining Lesson Plan-Revised Syllabus
No ratings yet
Data Mining Lesson Plan-Revised Syllabus
4 pages
Data Pre Processing
No ratings yet
Data Pre Processing
14 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Unit 3
No ratings yet
Unit 3
41 pages
Unit I
No ratings yet
Unit I
57 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan2024
No ratings yet
Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan2024
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Handout
No ratings yet
Handout
4 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Study Material I
No ratings yet
Study Material I
140 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Mining
No ratings yet
Data Mining
22 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
11 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Guidelines-Datamining-I - UGCF-BA-major-sem 3 - July 24
No ratings yet
Guidelines-Datamining-I - UGCF-BA-major-sem 3 - July 24
3 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
2 pages
Results: High Level Language Code Assembly Code
No ratings yet
Results: High Level Language Code Assembly Code
5 pages
Assignment of Operating System: Assigned by - Pinki Roy
No ratings yet
Assignment of Operating System: Assigned by - Pinki Roy
13 pages
Mobile Ad Hoc Networks Overview
No ratings yet
Mobile Ad Hoc Networks Overview
7 pages
Secure Disposable Email Guide
No ratings yet
Secure Disposable Email Guide
5 pages
Kuka - Xrob RCS: KUKA Robot Group KUKA System Technology (KST)
No ratings yet
Kuka - Xrob RCS: KUKA Robot Group KUKA System Technology (KST)
53 pages
Private Files MODULE-4-CHS11M4real-1 PDF
No ratings yet
Private Files MODULE-4-CHS11M4real-1 PDF
19 pages
Usability Study of Mobile Apps
No ratings yet
Usability Study of Mobile Apps
15 pages
Total Quality Management (TQM)
No ratings yet
Total Quality Management (TQM)
64 pages
iDS-7208HUHI-M1 X Datasheet 20241018
No ratings yet
iDS-7208HUHI-M1 X Datasheet 20241018
6 pages
SQL Stored Procedures Guide
No ratings yet
SQL Stored Procedures Guide
5 pages
Panasonic VL-SV74 PDF
No ratings yet
Panasonic VL-SV74 PDF
2 pages
Information Technology and Numerical Methods (Common To CSE and CIC)
No ratings yet
Information Technology and Numerical Methods (Common To CSE and CIC)
3 pages
Python Monthly Expense
No ratings yet
Python Monthly Expense
10 pages
WordPress Course Outline (AutoRecovered)
No ratings yet
WordPress Course Outline (AutoRecovered)
7 pages
OIC Questions
No ratings yet
OIC Questions
24 pages
PCI-Handbook 18122023
No ratings yet
PCI-Handbook 18122023
125 pages
CV Trajkovic Jelena
No ratings yet
CV Trajkovic Jelena
2 pages
Training Brochure 2022-23
No ratings yet
Training Brochure 2022-23
12 pages
AC1L AC3L Linux Manual BrosTrend WiFI Adapter v4
No ratings yet
AC1L AC3L Linux Manual BrosTrend WiFI Adapter v4
2 pages
Salpido Lite 1 Bluettooth Speaker Rm20 / Oem Wireless Mouse Rm10
No ratings yet
Salpido Lite 1 Bluettooth Speaker Rm20 / Oem Wireless Mouse Rm10
2 pages
SAP Certified Technology Specialist - SAP S - 4HANA - 241215 - 102448
No ratings yet
SAP Certified Technology Specialist - SAP S - 4HANA - 241215 - 102448
17 pages
Matrikon OPC UA Explorer: Datasheet
No ratings yet
Matrikon OPC UA Explorer: Datasheet
3 pages
Browse Word Help: Get The Latest Content While Working in The 2007 Release
No ratings yet
Browse Word Help: Get The Latest Content While Working in The 2007 Release
1 page
LogRhythm Schema Dictionary and Guide RevB
No ratings yet
LogRhythm Schema Dictionary and Guide RevB
226 pages
Lab Report Cse
No ratings yet
Lab Report Cse
5 pages
DE10-Lite User Manual: June 5, 2020
No ratings yet
DE10-Lite User Manual: June 5, 2020
74 pages
TFVC-CIDSSP (Medyo Final)
No ratings yet
TFVC-CIDSSP (Medyo Final)
16 pages
Azure DevOps
100% (1)
Azure DevOps
2 pages
2.4G Vs 5G Connections Astrea
No ratings yet
2.4G Vs 5G Connections Astrea
2 pages
E-commerce Dashboard Insights
No ratings yet
E-commerce Dashboard Insights
8 pages
YR 13 PUREdacs
No ratings yet
YR 13 PUREdacs
20 pages
Matrices for Aviation Students
No ratings yet
Matrices for Aviation Students
22 pages
Information Assurance Security Reviewer Exam 2nd Semester 2025 2026
No ratings yet
Information Assurance Security Reviewer Exam 2nd Semester 2025 2026
8 pages

Data Mining Notes: 7 Semester. CS 1435: Syllabus

Uploaded by

Data Mining Notes: 7 Semester. CS 1435: Syllabus

Uploaded by

Data Mining notes: 7th semester.

Semi structured : tables, images

Unstructured: videos and stuff

Hypothesis: it is a study of data received. A statement and its outcome(See def.)

Null hypothesis: one variable doesn’t affect other.

------- hypothesis: one variable may affect other

KDD- Knowledge discovering in db

Data cleaning – to remove the noise/ irrelevant data

uniform data in terms of key and data attributes

Data descrtisation – numerical data deal

Clean – fill misiing val, identify outliers

Ignore the tuple – when a % of data missing, then only ignore

Filling missing data not always feasible

Use a global constant to fill by like unknown, or NA, or ?

Probable value using decision tree or Bayesian theory

Clustering – detect and remove the outliers

Combined computer and human inspection – detect suspicious method

Equal width = W = (B-A)/N B – highest value, N is interval size

Data cleaning overall:

 Discrepancy detection – instrumntatition, human, outdated, inconsistecy

Handling redundant data: chi-square test

Eg 3.1 hellen kember data mining book

Data transformation: smoothing (regression, clustering, binning)

Normalization – scale it down acc to size of data. Acceptance of paper

Parameter – paratmeter of data is stored not totally data….non-para – clustering/regression etc

Lossy – lossless: data loss or not on getting back original data

Learning state: we describe state of predetermined classes

Classification – for classifying future or unknown object

Eg. If a customer is good to give a loan or not. So customer is safe or not

Decision tree induction: classifier in the form of tree.

Learning of decision tree from class level . so greedy strategy needed.

Id3: iterative determiser – decision tree algo

C4.5: successor of id3

Cart : classification and regression tree.

You might also like