0% found this document useful (0 votes)

13 views56 pages

Data Preprocessing

Chapter 2 of the Data Mining document focuses on data preprocessing, emphasizing its importance due to the presence of dirty, incomplete, and noisy real-world data. It outlines various preprocessing tasks including data cleaning, integration, transformation, reduction, and discretization, along with methods to handle missing values and noise. The chapter also discusses dimensionality reduction techniques and the significance of maintaining data quality for effective data mining and analytics.

Uploaded by

Sangat Rokaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views56 pages

Data Preprocessing

Uploaded by

Sangat Rokaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Data Mining

Chapter 2: Data Preprocessing

Khwopa College of Engineering- Dr. Jhanak Parajuli

Data Preprocessing

❏ Why data preprocessing?

❏ What is Data cleaning ?

❏ Data integration and transformation

❏ Data reduction

❏ Discretization and concept hierarchy generation

Process of Knowledge Discovery
Selection Pre-processing Data mining Post-processing

knowledge

Raw Data Target Data

Processed Data Data Patterns Visualization

Collection in Only selected Data is processed Discover patterns

using different Visualize the data to
database, flat data collected in data using
techniques such obtain proper knowledge
files, cloud or in database, different statistical
as normalization, and make future
any other flat files etc.. and modern
feature selection, business predictions
sources machine learning
dimensionality techniques, will
reduction, data help in predictive
subsetting etc. analytics
Why Data Preprocessing?

❏ Real world data is dirty

❏ incomplete: missing attribute values or features, lacking desired
attributes, or containing only aggregate data
■ noisy: containing errors or outliers
■ inconsistent: containing discrepancies in codes or names

❏ Garbage in, Garbage Out

■ Quality decisions must be based on quality data
■ Data warehouse needs consistent integration of quality data
■ Required for both OLAP and Data Mining!
Why are Data incomplete?

❏ Attributes of interest are not available at the time of recording data

❏ (e.g., customer information for sales transaction data)

❏ Data were not considered important at the time of transactions, so they were
not recorded!

❏ Data not recorder because of misunderstanding or malfunctions

❏ Data may have been recorded and later deleted!

❏ Missing/unknown values for some data

Why are Data noisy?

❏ Faulty instruments for data collection

❏ Human or computer errors

❏ Errors in data transmission

❏ Technology limitations (e.g., sensor data come at a faster rate than they can
be processed)

❏ Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be

2 May 2002 or 5 Feb 2002)

❏ Duplicate tuples, which were received twice should also be removed

Data Preprocessing Tasks
❏ Data Cleaning
❏ Fill in missing values
❏ Smoothen noisy data
❏ Identify and remove outliers
❏ Resolve inconsistencies
❏ Data Integration
❏ Integration of multiple database, data cubes or files
❏ Data Transformation
❏ Data normalization/scaling samples to have unit norm
❏ Standardization/mean removal and variance scaling
❏ Aggregation
❏ Data Reduction
❏ Dimensionality Reduction
❏ Data Discretization/quantization/binning
❏ Finite grouping (K-bins discretization)
❏ Feature binarization
Data Preprocessing Tasks
❏ Data Cleaning ❏ Data Transformation

❏ Data Reduction
❏ Data Integration

❏ Data Discretization
Data Cleaning

❏ Fill in missing values/Imputation of missing values

❏ Identify outliers
❏ smoothen noisy data
❏ Resolve inconsistencies e.g. duplicate entries

A mistake or a millionaire ?

Missing Values

Inconsistent duplicate entries

Noise

❏ Noise modifies the original value.

❏ Noise creates chaos in the data and might break the desired patterns
Outliers

❏ Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
Handling Missing Values

❏ Eliminate data objects:

❏ This is done when the missing values are few and do not lose a lot of data points
❏ Estimate missing values:
❏ Use statistical measures such as mean or median of the attributes (preferred when the values
are numerical)
❏ Use attribute mean or median for all samples belonging to the same class (classification
problems)
❏ Use inference based approach such as Bayesian formula or decision based tree
❏ Ignore missing values during analysis:
❏ If the missing values are much higher than the known values, then filling the missing values
may not make proper analysis.
❏ Replace the missing values with possible values
❏ Fill in manually : a tedious task for large dataset (not preferred)
❏ Use global constant to fill in the missing values (e.g. 0 or NaN or Unknown etc.. or other
values) -- not preferred for all datasets
❏ forward fill: fill the missing value with the value ahead (not preferred for all datasets)
❏ backward fill: fill the missing value with the value before (not preferred for all datasets)
Handling Missing Values

Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic

estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
Handling Missing Values

Python → sklearn.impute.SimpleImputer
❏ The SimpleImputer class provides basic strategies for imputing missing values. Missing values can
be imputed with a provided constant value, or using the statistics (mean, median or most frequent)
of each column in which the missing values are located. This class also allows for different missing
values encodings.

❏ The SimpleImputer class also supports categorical data represented as string values or pandas
categoricals when using the 'most_frequent' or 'constant' strategy:
Handling Data Noise

❏ Binning
❏ sort data and partition into equi-depth and equi-width bins
❏ smooth by bin means, bin median, bin boundaries, etc.
❏ Regression
❏ smooth by fitting a regression function
❏ Clustering
❏ detect and remove outliers
❏ Combined Computer and Human Inspection
❏ detect suspicious values automatically and check by human
❏ Use concept hierarchies
❏ e.g., price value -> “expensive”
Handling Data Noise (Binning)

❏ Equal-width (Distance) binning

❏ Divides the range into N intervals of equal size
❏ Width of intervals
❏ Simple
❏ Outliers may dominate result
❏ Equal-depth (frequency) binning
❏ Divides the range into N intervals, each containing approximately same
number of records
❏ Skewed data is also handled well

Equal Depth Binning

Handling Data Noise (Binning)

❏ Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
❏ Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

❏ Smoothing by bin means:

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

❏ Smoothing by bin boundaries: [4,15],[21,25],[26,34]

- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Handling Data Noise (Regression)

❏ Replace noisy data or missing data by predicted values

❏ Example: Linear Regression on missing continuous valued data
Handling Data Noise (Regression)

❏ Replace noisy data or missing data by predicted values

❏ Example: Linear Regression on missing continuous valued data
Handling Data Noise (Clustering)

❏ K-means Clustering is the most popular clustering technique:

❏ The K-Means algorithm clusters data by trying to separate samples in K groups of equal
variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.

outliers
clusters
Combined Computer and Human Inspection

❏ Inconsistent data are handled by:

❏ Manual correction (expensive and tedious)
❏ Use routines designed to detect inconsistencies and manually correct
them. E.g., the routine may use the check global constraints (age>10) or
functional dependencies
❏ Other inconsistencies (e.g., between names of the same attribute) can be
corrected during the data integration process
❏ Detect suspicious values automatically
❏ Use statistical formula to remove outliers
❏ By calculating, Z-score = (X - mean)/sigma; +/-3*Z-score is considered as outlier
❏ By histogram plot
❏ By using boxplot or Interquartile range (IQR)
Boxplot

❏ Boxplot: Use boxplot to detect outliers and remove them

Data Integration

❏ Combining data from multiple sources

❏ Multiple sources: Relational Database, NOSQL database, flat files etc.
❏ Schema integration: Integrating metadata from different sources
❏ metadata: data that contains describes the data; data descriptors
❏ Maintaining data dictionary
❏ data-dictionary: explanation of each of the data features
❏ Detecting and resolving data value conflicts
❏ values from different sources might be different
❏ determine the differences and deviation
❏ possible reasons: different representations, different scales, different time
zones
Handling Redundant Data

❏ Redundant data occur often when integration of multiple databases

■ The same attribute may have different names in different databases
■ One attribute may be a “derived” attribute in another table, e.g., annual
revenue

❏ Redundant data can be detected by correlation analysis

❏ Careful integration of the data from multiple sources will help to avoid or
reduce the redundant data and inconsistencies.

❏ Redundancies removal improves data mining speed and quality

Data Portability

SAX =Symbolic Aggregate Approximation

DWT = Discrete Wavelet Transform

DFT = Discrete Fourier Transform, MDS = Multidimensional scaling

Reference: Book Charu C. Aggrawal, Data Mining Text Book

Assignment 2 (due 2 weeks - June 19, 2021)

1. For the given dataset, do the following data cleaning tasks. Use python as
programming language. Perform the task in your jupyter notebook.
a. Load data as pandas dataframe
b. Find the number of missing values in each column
c. delete missing rows
d. fill the missing values with mean
e. fill the missing values with median
f. find outliers in the data
g. use linear regression to remove and replace outliers
Data Transformation

❏ Data Transformation → Process in which data is consolidated or transformed

into other stand forms which are suitable for data mining and/or suitable for
using known machine learning algorithms for predictive analytics
❏ Normalization: It is the process of scaling individual samples to have unit
❏ What is a vector norm ?

❏ Normalization Techniques:
❏ Decimal Scaling
❏ Min-Max scaling
❏ Z-score Normalization/Standardization

❏ Feature generation: Create new features from the given features

Why Normalization ?

❏ Speeds-up some learning techniques (eg. neural networks)

❏ Helps prevent attributes with large ranges outweigh ones with small ranges

❏ Example:
❏ income has range 3000-200000
❏ age has range 10-80
❏ gender has domain M/F
Normalization Techniques

❏ Normalization by decimal scaling

❏ Min-Max Normalization

❏ Z-score Normalization/Standardization
Variable Transformation

❏ Simple mathematical formulation is applied to each value individually

❏ If x is a variable, then examples of such transformations include x^k,
logx, e^x, √x, 1/x, sinx, or |x|.
❏ In statistics, variable transformations, especially sqrt, log, and 1/x, are
often used to transform data that does not have a Gaussian (normal)
distribution into data that does.
❏ Variable transformations should be applied with caution since they
change the nature of the data.
❏ To help clarify the effect of a transformation, it is important to ask
questions such as the following:
❏ Does the order need to be maintained?
❏ Does the transformation apply to all values, especially negative
values and 0?
❏ What is the effect of the transformation on the values between 0
and 1?
Sampling

❏ Used for selecting subset of data objects to be analyzed

❏ Sampling is done because it is too expensive or time consuming to process all the
data
❏ Sample is representative if it has same property as the original dataset.

❏ Choosing of appropriate Sampling size and sampling technique is necessary to

obtain a representative sample with high probability.
Sampling Approaches

❏ Simple random sampling:

❏ equal probability of selecting any item
❏ Two types:
❏ Sampling without replacement
❏ sampling with replacement
❏ Simple random sampling doesn’t properly represent imbalanced
dataset
❏ Stratified Sampling:
❏ Sampling is done on a prespecified group of objects
❏ Each group is created so that the data is balanced in each group
❏ Equal number of data points are taken from each group
Loss of information with Sampling
Sampling Approaches

❏ Proper sampling size is difficult to obtain

❏ In that case, we use Progressive or adaptive sampling techniques
❏ Start with a small sample size and increase the size until a sample of
sufficient size has been obtained

❏ In predictive modeling, we analyze the model error/inaccuracy with initial

sample points, increase the sample points and see that the inaccuracy drops.
If we keep on increasing the sample data, the inaccuracy increases after
some point. That will be the optimal sample size.
Dimensionality Reduction

❏ As the name suggests, dimensionality reduction is the method of reducing the

dimensionality or the number of features in the dataset.
❏ E.g. consider a large document data. For processing such data, we embed
each word with a vector and such data can have thousands of dimensions.
❏ In that case, data becomes increasingly sparse and it is very difficult to
analyze.
❏ Many data mining algorithms work better if the dimensionality is lower. e.g.
k-means clustering.
❏ Dimensionality reduction helps to understand the model better as it has fewer
attributes.
❏ Reducing irrelevant features reduces noise in the dataset.
The Curse of Dimensionality

❏ As the dimensionality increases, the data becomes sparse and difficult to

analyze. This is called the Curse of dimensionality.
❏ For classification, this can mean that there are not enough data objects to
allow the creation of a model that reliably assigns a class to all possible
objects.
❏ For clustering, the definitions of density and the distance between points,
which are critical for clustering, become less meaningful.
Linear Algebra Techniques for Dimensionality Reduction

1. Principal component analysis (PCA)

2. Singular Value Decomposition (SVD)

Principal Component Analysis:

❏ Used for dimensionality reduction in continuous data

❏ Given N data vectors from k-dimensions, find c <= k orthogonal vectors that
can be best used to represent data

Please follow attached notebooks for more on dimensionality reduction

Principal Component Analysis

X1, X2: original axes (attributes) X2

Y1,Y2: principal components
Y1
Y2
significant component
(high variance)

Order principal components by significance and eliminate weaker ones

Measures of Similarity and dissimilarity

❏ Similarity measurement is used in many data mining techniques

such as clustering, nearest neighboring classification and anomaly
detection.
❏ Once the similarity and dissimilarity measurement is known, we
may not even need the initial dataset
❏ data is transformed to similarity/dissimilarity (proximity) space
and analyzed
❏ How to measure the proximity between objects having only one
attribute ?
❏ How to measure the proximity between objects having multiple
attributes ?
Proximity Measures between two objects

❏ Similarity:
❏ numerical measure of the degree to which two objects are alike
❏ similarity = 0 → no similarity
❏ similarity = 1 → complete similarity
❏ dissimilarity:
❏ numerical measure of the degree to which two objects are different
❏ distance is sometimes used as a synonym for dissimilarity, but distance
only refers to special class of dissimilarities.
❏ dissimilarities are measured between [0,1], but also sometimes between
[0,∞]
❏ Similarity → dissimilarity
❏ If the similarity and dissimilarity fall in the interval [0,1], similarity = 1-
dissimilarity OR
❏ Similarity is the negative of dissimilarity
Proximity Measures for single attribute

❏ If the attribute is nominal:

❏ Nominal attributes only convey information about the distinctness of the
object
❏ Two objects can be similar or dissimilar
❏ similarity = 1, dissimilarity = 0
❏ If the attribute is ordinal:
❏ Information about order should be taken into account
❏ e.g. quality of a product (poor, fair, ok, good, wonderful)
❏ If product 1 is wonderful and P2 is good, they are more similar than if P1
is wonderful and P2 is fair
❏ Each ordinal quality is mapped to successive integer and the difference
between the two gives the measure of dissimilarity.
❏ (poor, fair, ok, good, wonderful) → (1,2,3,4,5) then P1(wonderful),
P2(good) = d(P1,P2) = 5-4 =1
❏ If we want d(P1,P2) to fall between [0,1] → d(P1,P2)/(len(order)-1) →1/4
❏ Is it fair to give equal weight to each order ?
Proximity Measures for single attribute

❏ If the attribute is interval or ratio:

❏ dissimilarity between two objects = absolute difference between their
values
❏ dissimilarity range from 0 to infinity
❏ E.g.. difference of current weight and weight 6 months ago is expressed
in absolute numbers

Reference: Introduction to Data Mining , Tan, Steinbach , Kumar

Dissimilarities Measures between Data Objects

We consider the following measurement of dissimilarities

● Minkowski Distance: Given data x and y, the Minkowski distance is given

by the following formula:

where r is the parameter, n is the number of dimensions and x_k and y_k are the
k-th attribute of x and y.

-- Named after Hermann Minkowski is a German, Polish mathematician

Dissimilarities Measures between Data Objects

● Manhattan Distance (Taxicab distance or City block distance): Given

data x and y, the Manhattan distance is calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.

● r = 1 in Minkowski distance
● Also called L1 norm
● It is the distance a car would drive in a city (e.g., Manhattan)
Dissimilarities Measures between Data Objects

● Euclidean Distance: Given data x and y, the Euclidean distance is

calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.

● r = 2 in Minkowski distance
● Also called L2 norm
Dissimilarities Measures between Data Objects

● Supremum Distance: Given data x and y, the Supremum distance is

calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.
● This is the maximum of absolute value of given data e.g. ||(1, −4, 5)||∞ =
max{|1|, | − 4|, |5|} = 5
● r → ∞ in Minkowski distance
● Also called Lmax or L∞ norm
Dissimilarities Measures between Data Objects

Reference: Introduction to Data Mining, Tan, Steinbach and Kumar

Properties of distance

● Measures that satisfy all three properties are known as metrics.

Do all dissimilarities satisfy all three properties?

❏ No - Such are called Non-metric dissimilarities

❏ E.g. Set differences B
A
❏ A = {1,2,3,4}, B={2,3,4}, A-B = {1}, B-A = {Phi} 1
2
3
4

❏ If we define d(A,B) = size(A-B) , then the symmetry property is

violated
❏ This also violates triangle inequality property
❏ However if we define d(A,B) = size(A-B) + size(B-A) then symmetry
and triangle inequality property is preserved
Do all dissimilarities satisfy all three properties?

❏ Example 2: Time difference

❏ Let us consider the distance between two time is measured as

❏ d(1, 2) = 1, but d(2,1) = 23

❏ violates symmetry property and triangle inequality property
Similarities Measures between Data objects

❏ For Similarities Measures triangle inequality do not hold but positivity and
symmetry do.

❏ There is no general analog of triangle inequality for similarity measures.

However, there are some similarity measures where the triangle inequality
holds. E.g. Cosine and Jaccard similarity measures
❏ In some similarities measures symmetry property do not hold. For example
confusion matrix for classification. What is the similarity measure for 0 and
o? If we consider the number in the matrix,then it is not symmetric.
❏ Hence similarity is measured as:
Predicted 0 predicted o
s’(y,x) = s’(x,y) = (s(x,y)+s(y,x))/2
Actual 0 160 40

Actual o 30 170
Similarity Measures for Binary Data
❏ Simple Matching Coefficient:

❏ For the example data,

f_{00} = 0, f_{01} = 2, f_{10} = 1, f_{11} = 2 ,
SMC = ⅖ = 0.4
❏ This measure counts both presence and absence equally
Similarity Measures for Binary Data
❏ Jaccard Similarity Coefficient:
❏ Used when the given binary attribute is asymmetric
❏ E.g. In a shop, items purchased by customer =1 and items not
purchased = 0. In that case, for each transaction 0 outnumbers 1. It
means f_{00} will be very high compared to f_{11}. If we use SMC,
this doesn’t capture the similarity measure truly.
Cosine Similarity

❏ Used for non binary data

❏ Used for Document data
❏ If we want to measure similarity between two documents, the document is
represented by a vector, where each attribute is the frequency of the word
occurred in the document.
❏ Document processing also ignores certain common words
❏ The vector is non binary and can be sparse, we use cosine similarity to
measure similarity between two documents.

Reference: Introduction to data mining, Tan, Steinbach and Kumar

Cosine Similarity

Example:

Cosine similarity is the measure of the angle between x and y.

Follow lecture videos, text-book and shared jupyter notebooks for examples and
other details

Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Unit - II
No ratings yet
Unit - II
56 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Pre Processing I
No ratings yet
Data Pre Processing I
37 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
Data Pre Processing
No ratings yet
Data Pre Processing
14 pages
Data Pre-Processing & Cleaning Guide
No ratings yet
Data Pre-Processing & Cleaning Guide
37 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit 2
No ratings yet
Unit 2
34 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Unit 2
No ratings yet
Unit 2
46 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
CH 2
No ratings yet
CH 2
36 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Machine Learning From Scratch PDF
89% (9)
Machine Learning From Scratch PDF
124 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Data Visualization With Python PDF
93% (15)
Data Visualization With Python PDF
662 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
The Python Bible
97% (31)
The Python Bible
506 pages
200 Python Practice Exercises 1687850509
89% (9)
200 Python Practice Exercises 1687850509
122 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Python Notes For Professionals
100% (18)
Python Notes For Professionals
814 pages
Machine Learning With Python.
100% (2)
Machine Learning With Python.
147 pages
Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
Data Structure and Algorithms With Python
100% (15)
Data Structure and Algorithms With Python
369 pages
Data Visualization in Python Preview PDF
100% (9)
Data Visualization in Python Preview PDF
58 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Python Data Science Course Guide
100% (1)
Python Data Science Course Guide
5 pages
Analytics Python Programming
92% (13)
Analytics Python Programming
203 pages
Getting Started With Python Programming
100% (11)
Getting Started With Python Programming
1,484 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Python For Data Science The Ultimate Beginners Guide To Learning Python Data Science Step by Step - Compress
100% (6)
Python For Data Science The Ultimate Beginners Guide To Learning Python Data Science Step by Step - Compress
148 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Python - Module 3
No ratings yet
Python - Module 3
86 pages
Beginners Python Cheat Sheet PCC All
96% (28)
Beginners Python Cheat Sheet PCC All
26 pages
SQL PDF
100% (13)
SQL PDF
221 pages
Tensorflow 2 Tutorial PDF
100% (4)
Tensorflow 2 Tutorial PDF
66 pages
Data Mining A Tutorial-Based Primer, Second Edition PDF
100% (1)
Data Mining A Tutorial-Based Primer, Second Edition PDF
530 pages
Let Us Python by Yashavant Kanetkar
89% (27)
Let Us Python by Yashavant Kanetkar
429 pages
Python PPT
71% (14)
Python PPT
13 pages
Design and Implementation of An Online Student Clearance System (Case Study Delta State Polytechnic, Ozoro)
No ratings yet
Design and Implementation of An Online Student Clearance System (Case Study Delta State Polytechnic, Ozoro)
57 pages
Machine Learning Exam Guide
No ratings yet
Machine Learning Exam Guide
2 pages
Youtube Channel For Comps IT Engineering
No ratings yet
Youtube Channel For Comps IT Engineering
11 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
BASGAI Curriculum Checksheet 20191028
No ratings yet
BASGAI Curriculum Checksheet 20191028
1 page
SQL & Database Basics for Beginners
No ratings yet
SQL & Database Basics for Beginners
11 pages
DMS Questions and Answers
No ratings yet
DMS Questions and Answers
10 pages
Smart Intelligent Fashion Recommendation System
No ratings yet
Smart Intelligent Fashion Recommendation System
7 pages
Requirements As The Driving Force For Data Warehousing: Mr. Hubert I. Caguiat
No ratings yet
Requirements As The Driving Force For Data Warehousing: Mr. Hubert I. Caguiat
64 pages
Management Information Systems Assignment 2
No ratings yet
Management Information Systems Assignment 2
8 pages
Class 10 Term 1 Blue Print (2023-24)
No ratings yet
Class 10 Term 1 Blue Print (2023-24)
1 page
B.tech. Open Elective III & IV List 4th Year VIII Semester 2021-22
100% (1)
B.tech. Open Elective III & IV List 4th Year VIII Semester 2021-22
26 pages
Aditya YADAV Resume CV PDF
No ratings yet
Aditya YADAV Resume CV PDF
1 page
REAL-TIME FACE MASK DETECTOR FOR COVID - 19 - Group 1 Team 5
No ratings yet
REAL-TIME FACE MASK DETECTOR FOR COVID - 19 - Group 1 Team 5
11 pages
Ss 2 Data Processing First Term E-Note
No ratings yet
Ss 2 Data Processing First Term E-Note
48 pages
Unit 4 Artificial Intelligence
No ratings yet
Unit 4 Artificial Intelligence
42 pages
Lab 1
No ratings yet
Lab 1
17 pages
Systems Integration and Architecture 1 v2
No ratings yet
Systems Integration and Architecture 1 v2
3 pages
Transformer Architecture Guide
No ratings yet
Transformer Architecture Guide
2 pages
Module 1
No ratings yet
Module 1
34 pages
RAG Syllabus R&D
No ratings yet
RAG Syllabus R&D
6 pages
IoT Assignment #2
No ratings yet
IoT Assignment #2
4 pages
Computerized Library System Thesis
100% (2)
Computerized Library System Thesis
5 pages
Developers' Guide to OpenAI Codex
No ratings yet
Developers' Guide to OpenAI Codex
2 pages
Vedanta M S FlowCV Resume 20250428
No ratings yet
Vedanta M S FlowCV Resume 20250428
1 page
Deep Fake Detection
No ratings yet
Deep Fake Detection
3 pages
AI & Innovation: A Systematic Review
No ratings yet
AI & Innovation: A Systematic Review
25 pages
Class Number: Name: Section: Schedule: Date:: Nursing Informatics - Lecture & Laboratory Module #1 Student Activity Sheet
No ratings yet
Class Number: Name: Section: Schedule: Date:: Nursing Informatics - Lecture & Laboratory Module #1 Student Activity Sheet
3 pages
CT2 Assignment
No ratings yet
CT2 Assignment
3 pages
Resume 1
No ratings yet
Resume 1
2 pages