Data Preparation For Machine Learning Mo

The document discusses the importance of data preparation for machine learning modeling, emphasizing that well-prepared data is essential for accurate predictive outcomes. It outlines key steps in data preparation, including data collection, cleaning, transformation, and reduction, while highlighting techniques such as data augmentation and exploratory data analysis. The paper underscores the 'garbage-in-garbage-out' principle, stressing that the quality of input data directly affects the performance of machine learning models.

Uploaded by

maria isabel Vidal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views5 pages

Data Preparation For Machine Learning Mo

Uploaded by

maria isabel Vidal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

International Journal of Computer Applications Technology and Research

Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656

DOI:10.7753/IJCATR1106.1008

Data Preparation for Machine Learning Modelling

Ndung’u Rachael Njeri

Information Technology Department
Murang’a University of Technology
Murang’a, Kenya

Abstract: The world today is on revolution 4.0 which is data-driven. The majority of organizations and systems are using data to solve
problems through use of digitized systems. Data lets intelligent systems and their applications learn and adapt to mined insights without
been programmed. Data mining and analysis requires smart tools, techniques and methods with capability of extracting useful patterns,
trends and knowledge, which can be used as business intelligence by organizations as they map their strategic plans. Predictive intelligent
systems can be very useful in various fields as solutions to many existential issues. Accurate output from such predictive intelligent
systems can only be ascertained by having well prepared data that suits the predictive machine learning function. Machine learning
models learns from data input using the ‘garbage-in-garbage-out’ concept. Cleaned, pre-processed and consistent data would produce
accurate output as compared to inconsistent, noisy and erroneous data.

Keywords: Data Preparation; Data pre-processing; Machine Learning; Predictive models

1. INTRODUCTION 2.1 DATA PREPARATION

The world is witnessing a fourth industrial revolution, which is Data preparation is the process of converting raw data through
fast-paced due to technological evolutions and advancements. pre-processing before being used in fitting and evaluating
Today, digital systems are been experienced in all spheres of machine learning predictive systems [6]. Machine learning
the industries including and not limited to healthcare, models are particular to their data source, and hence the
education, manufacturing, entertainment, and credibility of the data source and utility of the data collected is
telecommunication where there’s a wealth of data. The digital essential. It is plausible for a machine learning model to be high
systems have become sources of massive data, where insights end model but training it with the wrong data yields the wrong
can be extracted and analyzed for new patterns and new information. Machine learning models operate on the “garbage
knowledge that may be useful in building various smart in, garbage out” philosophy, and data scientists ensure the
applications in the pertinent domains. “garbage in” remains relevant, for the resultant information to
be relevant. Standardizing your data entry point ensures the
2. Data Pre-processing right information is attained at the end result. For these reasons,
Data pre-processing is an important step while developing
data collection remains an imperative part of data preparation.
smart systems or while extracting meaningful insights using
machine learning. Data processing is sometimes used
Data preparation ascertains minimal errors in your data, and
interchangeably with data preparation; however, data
allows for data monitoring of any future errors. This will
processing is inclusive of both data preparation and feature
eventual ensure the machine learning is trained with the correct
engineering whereas data preparation excludes feature
data and hence the output will be accurate. Data exploration
engineering [4]. Before data preparation, there is usually need
analysis will provide a summary of your data set, and allow for
to understand the output you require from the machine model
necessary changes or formatting to be done. Any data source in
to be trained, and hence the subsequent data attributes that will
machine learning is divide into both the training and the test
shape the output. With the output in mind, the data to be
data, and the technique of this division is achieved during data
collected is easily identifiable, and thus its quality and value
preparation. Additionally, data preparation helps in shaping the
requirements defined. This problem articulation ascertains the
data to fit the requirements of the machine learning model.
right steps of data preparation are followed.
Some data sets have attributes that are not well ordered for
The data pre-processing involves data cleaning, which involves
analysis. Other times, the ranges in the data sets to be compared
removal of ‘dirt’ or noise in data, removal of missing or
largely vary, resulting to comparison challenges. Data
inconsistent data, data integration if data is sourced from
transformation allows for such data sets to be transformed into
multiple sources, data transformations depending on the type of
good representations of the initial data source, without losing
raw data to what the machine learning algorithms can use as
data relevancy or data integrity. Some training models accept
inputs, data reduction where unnecessary data is removed and
input data in certain formats, necessitating data transformation.
only data that is required to develop an application is retained
[5]. Data pre-processing makes sure that the data types to use
In an era of big data, there is need to create better storage
in machine learning functions are transformed, an imposition
techniques and often times this is costly, both in terms of
requirement by some machine learning algorithms on data,
storing the big data, and in analyzing it. Big data analytics
with some having non-linear relationships that complicates
require complex software which is expensive. Data reduction
how the algorithms functions [6].
comes in handy in compressing data into more manageable
volumes while retaining its relevance and integrity.
Additionally, the reduced volumes can be used in computations
as a representation of the whole data set with trivial to zero

www.ijcat.com 231
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

impact on the initial data source, and the output of the model. 2.2.4 Data aggregation
Data reduction reduces the overall cost of data analysis, and Data aggregation is a technique of reducing the volume of data
saves on the time that would have otherwise been employed in though grouping. This grouping is usually of a single attribute.
future data processing. For instance, when one has a data set with the attribute time
organized in days over a given time series, one can aggregate
The main four steps for data preparation are data collection, the data into monthly groups which eases dealing with the time
data cleaning, data transformation and data reduction. attribute. It aids in reducing the broadness of a given attribute
without tangible losses during future data manipulation [10].
2.2 DATA COLLECTION
Data collection is the initial stage of data preparation, and it 2.3 DATA CLEANING
involves deciding on the data set depending on the expected Data cleaning, also referred to as data cleansing is the technique
output of the machine model to be trained. Essentially, of detecting and correcting errors and inaccuracies in the
collection of the right data set ascertains the right data output. collected data [11]. Data is supposed to be consistent with the
Data collection consists of data acquisition, data labeling, data input requirement of the machine learning model. The main
augmentation, data integration and data aggregation. activities in data cleansing involve the fine-tuning of the noisy
data and dealing with missing data. It aids in ensuring the
2.2.1 Data acquisition. collected data set is comprehensive and any errors and biases
Data acquisition involves identifying the data source, defining that may have arose in data collection have been eliminated.
the methodology of collecting the data, and converting the This includes the detection of outliers within the data set; both
collected data into digital form for computation. The data for the numerical and the non-numerical data sets.
source can be primary, where data is obtained straight from the
persons, objects or processes being studied. When your data In
this stage, exploratory data analysis (EDA) is used, and it is a 2.3.1 Exploratory Data Analysis
technique that aims at understanding the characteristics and on the information that can be attained from the collected data,
attributes of the data sets [12]. It aids in the data scientist and sometimes involves data visualization. Data visualization
becoming more familiarized with the data collected. In allows for the understanding of data properties as skewness and
exploratory data analysis, statistical tools and techniques are outliers.
applied in building hypothesis source is a party that had
previously collected data, it is termed as a secondary source. Exploratory data analysis is mainly done on the statistical
Methodology of data collection varies depending on the manipulation software. The graphical techniques allow for
expected output. Statistical tools and techniques are applied in understanding the distribution of the data set, and the statistical
both the collection of qualitative and quantitative data. summary of all attributes. EDA allows for future decisions such
as the data cleansing techniques to be used, what data
transformations are necessary and whether data reduction is
2.2.2 Data labelling necessary and if yes, what is technique to use. Exploratory data
As machine learning advances, there is development of deep analysis is a continuous process all through data preparation.
learning techniques which have automated the generation of
features from data sets, and hence the requirement of high 2.3.2 Missing Data
volumes labelled data [7]. Data labelling is the process through While it is important to ascertain during data collection that all
which the data models are trained through tagging of data the attributes of the data sets have their real value collected,
samples. For instance, if a model is expected to tell the data sometimes has some of the attributes with missing values,
difference between images of cats and dogs, it will be initially which makes it hard to use as input in machine learning models.
introduced to images of cats and dogs, which are tagged as As so, different techniques have been outlined on how to deal
either cats or dogs. This is done manually, though often with with missing data. Data manipulation platforms as python and
the aid of a software. This part of supervised learning allows R statistics have some of these techniques of dealing with
the model to form a basis of future learning. The initial missing data embedded in them. The best technique usually
formation of a pattern in both the input and output data, defines varies with the data set, and hence after data assessment in the
the requirements of the data to be collected. Therefore, before exploratory data analysis, one can easily select the best
data collection is initialized, there is need to delineate the data technique for missing data imputation.
parameters and the intended information to be retrieved from 2.3.2.1 Deductive Imputation
the data. Deductive imputation follows the basic rule of logic, and is
2.2.3 Data augmentation hence the easiest imputation, however, the most time
Data augmentation is a data preparation strategy that is used in consuming. Even so, its results are usually highly accurate. For
increasing data diversity for deep learning model training [8]. instance, if student data indicates that the total number of
It involves construction of iterative optimization with the aim students is 10, and the total number of examinations papers is
of developing new training data from already existing data. It 10, but there is a paper with a missing name and John has no
allows for the introduction of unobserved data or introduction marks recorded, logic dictates the nameless paper is John’s.
of variables that are inferred through mathematical models [9]. However, deductive imputation is not applicable in all types of
While not always necessary, it is essential when the data being data sets [13].
trained is complex and the available volume of sampled data is 2.3.2.2 Mean/Median/Mode Imputation
small. Data augmentation saves the problem of limited data and This imputation uses statistical techniques where the central
model overfitting [10]. measures of tendency within a certain attribute are computed
and the missing values replaced with the computed measure of
central tendency, may it be mean, mode or the median of that
attribute [13]. This technique is applied in numerical data sets,

www.ijcat.com 232
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

and the impact on the output or later computations is trivial. 2.4 DATA TRANSFORMATION
Data manipulation platforms as python and R statistics have Data transformation involves shifting the cleansed data from
techniques of dealing with missing data embedded in them. one format to the next, from one structure to the next, or
changing the values in the cleansed data set to meet the
2.3.3 Noisy Data. requirements of the machine learning model [18]. The
Presence of noisy data can have substantial effect on the output simplicity of the data transformation is highly dependent on the
of a machine model. It negatively impacts on prediction of required data for input, and the available data set. Data
information, ranking results, and the accuracy in clustering and transformation involves:
classification [14]. Noisy data includes unnecessary
information in the data, redundant data values and duplicates or 2.4.1 Normalization
pointless data values. These result from faultiness in collection Normalization is a technique for data transformation that is
of data, problems that may result from data entry, problems that applied in numeric values of columns when there is for a
occur from data transfer techniques applied, uneven naming common scale. This transformation is achieved without loss of
conventions of the data and sometimes it may arise from information, but only changing how it is represented. For
technology restrictions, as in the case of unstructured data. instance, in a data set with two columns that have different
Noisy data is eliminated through. scales such as one with values ranging from 100 to 1,000 and
another column with a value range of 10,0000 to 1,000,000
2.3.3.1 Binning Method there may arise a difficulty in the event that the two columns
This involves arranging data into groups of given intervals, and have to be used together in machine learning modelling.
is used in smoothening ordered data. The binning method relies Normalization finds a solution by finding a way of representing
on the measures of central tendency and it is done in one of the same information without loss of distribution or ratios from
three ways. Smoothing by bin means, smoothing by bin median the initial data set [19].
and smoothing by bin boundary. It is imperative to note that while normalization is only
2.3.3.2 Regression necessitated by the nature of some data sets, other times it is
Linear Regression is a statistical and supervised machine demanded by the machine learning algorithms being used.
learning technique, that predicts particular data based on Normalization uses different mathematical techniques such as
existing data [15]. Simple linear regression is used to compute z-score in data standardization. The technique picked is usually
the best line of fit based on existing data, and hence outliers in decided depending on the nature and characteristics of the
the data can be identified. To attain the best line fit, there is dataset. Therefore, it is decided at the exploratory data analysis
development of the regression function based on the prior stage.
collected data. However, it is important to note that though in
some data sets, extreme outliers are considered noisy data, the 2.4.2 Attribute selection
outliers can be essential to the model. In this transformation, latent attributes are created based on the
For instance, if an online retailer company has its market available attributes in the data set to facilitate the data mining
within countries in Europe and trivial market in the United process [18]. The latent attributes created usually have no
States, the United States may be considered an extreme outlier, impact on the initial data source, and therefore can be ignored
and hence noisy data. However, a machine learning model may afterwards. Attribute transformation usually facilitates
realize that though a very small number of the Americans use classification, clustering and regression algorithms. Basic
the online platform, they bring in more revenue than some of attribute transformation involves decomposition of the
the countries in Europe. Simple linear regression uses one available attributes through arithmetic or logical operations.
independent variable whereas multiple linear regression uses For instance, a data set with a time attribute given in months,
more than one independent variable in its computations. can have its month attribute decomposed to weeks, or
aggregated to years depending on the requirements.
2.3.3.3 Clustering
Clustering is in the unsupervised machine learning category 2.4.3 Discretization
and it operates by basically grouping the collected data set into In data transformation by discretization, there is creation of
clusters, based on their attributes (Gupta & Merchant, 2016). In intervals or labels, and eventual mapping of the all data points
clustering, the outliers in the data may fall within the clusters, to the created data intervals or labels. The data in question is
and in the case that they are extreme outliers they fall outside customarily numeric data. There are different statistical
the clusters. To understand the effect of clustering, data techniques used in discretization of data sets. The binning
visualization techniques are used “Clustering methods don’t method is used on ordered data, where the data is creation of
use output information for training, but instead let the algorithm data intervals called bins where all the data points are mapped
define the output” [17]. There are different techniques used in into. In data discretization by histogram analysis, histograms
clustering. are used in dividing the values of the attribute into disjoint
ranges where all other data points are mapped to. Both binning
In K-means clustering, K is the number of clusters to be made, and histogram analysis are unsupervised data discretization
and to do this the algorithm randomly selects K number of data methods.
points from the data set. These K data points are called the
centroids of the data, and every other data point in the data set In data discretization by decision tree analysis, the algorithm
is assigned to the closest centroid. This process is repeated for picks the attribute with the minimum entropy, and uses its
all the new K data sets created, and the process iterated until minimum value as the point from which it, in iterations,
the centroids become constant, or fairly constant. This is called partitions the resulting intervals till it attains as many different
the point at which convergence occurs. The Density-Based groups as possible [20]. This discretization is hierarchical
Clustering of Applications with Noise (DBSCAN) is used in hence its name. To use an analogy, it’s like dividing a room
data set smoothing. into two equal parts, and continuously dividing the resulting
partitions into two other equal parts. Only in this case, the room
has multi-varied contents and we want each different content in

www.ijcat.com 233
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

its own space at the end of the partitioning. This discretization the use of clustering, sampling, use of histograms and data cube
technique uses a top-down approach and is a supervised aggregation to represent the whole data population, during
algorithm. computations and storage.

Data discretization by correlation analysis is highly dependent

on mathematical tools and it applies a bottom-up approach,
3. POSSIBLE BIASES IN DATA
unlike decision trees [20]. It maps data points to data intervals PREPERATION
by the best neighboring interval for each data point, and Bias in the data to be trained in the machine learning model
merging the intervals. It then recursively repeats the process to leads to consequent wrong information output. It is imperative
create one large interval. It is a supervised machine learning to identify the source of any bias in your data set during data
methodology. preparation and eliminate the bias [25]. Sample bias occurs at
2.4.4 Concept Hierarchy Generation data collection where the selected data sample is not the right
In concept hierarchy data transformation, there is mapping of representation of the population under study, hence it is also
low-level concepts within the attributes to higher level concepts called selection bias. For instance, an iris scan recognition
[21]. Most of these concepts are normally implied in the initial trained entirely on the iris scans of Africans will not efficiently
data set, and hence the technique is embedded in statistical identify eyes of the white population.
software. It follows a bottom up approach. For instance, in the
location dimension, cities can be mapped to their states, their
Exclusion bias is common in the data cleansing stage where
provinces, their countries and eventually their continents.
. there is deletion, or misrepresentation of a part of the data,
leading to it being excluded in the model training.
2.5 DATA REDUCTION Measurement bias occurs either during data collection, where
With the advancement of trends in information technology and the system of collecting input data is not the same as that of
the exponential growth of internet of things, there has been an collecting output data. Additionally, it occurs during data
eventual precipitous increase in the volumes of available data. labelling, where non-uniform data labelling results to faulty
This is a huge benefit to machine learning as the availability of
predictions from the machine learning model. Recall bias also
big data for training the models ascertains accuracies in the
outputted information from such models. Nonetheless, occurs at the data labelling stage, where the labelling is non-
handling and analyzing these enormous volumes of data is a big consistent [25].
challenge, hence the need for data reduction techniques. Data
reduction reduces the cost of analyzing and storing these Observer bias is data fallacy where the person dealing with the
volumes of data by increasing storage efficiency. The different data assumes the observation to be wat they expected, as
techniques used in data reduction include. opposed to the real observation. Data scientists and researchers
2.5.1 Data cube aggregation are encouraged to operate on an objective rather than subjective
A data cube is an n-dimensional array that uses mathematical approach to avoid this bias [19]. Another is racial bias, and the
tensors to represent information. the online analytical best example of this bias in talk balk engines, where the model
processing (OLAP) cube stores data in a multidimensional was largely trained on the voice data of the white population,
form, which occupies lesser storage space compared to a and hence it hardly recognizes the voice of the black data
unidimensional storage technique [22]. To access data from the
population [19]. Association bias occurs when a data set has
OLAP cube, the Multidimensional expressional (MDX) query
language is used. The query language includes the roll-up, drill- created an implicit association between attributes. The main
down, slice and dice and pivot operations. These operations association bias is the gender bias, as in the case where a system
allow access to the required attributes of the data from the cube, is trained with all school principals being males, and hence
without removing the data from the data cube, hence saving on eventually disqualifies the plausibility of a female school
space. principle [25].
2.5.2 Attribute subset selection
Attribute subset selection, also known as feature selection is a 4. CONCLUSION
part of feature engineering and it involves the discovery of the Many machine learning predictive systems and models are
smallest possible subset of attributes that would yield the same affected by the kind of data that is used as input of the models.
results or closest to the same results on data mining, as when Results of the predictive models are determined by the machine
using all the attributes [23]. This technique ensures that only
learning algorithm function and the kind of data input. Biased
what is completely necessary from the initial data set is used in
the modeling. This simplifies detection of insights, patterns and data will produce biased results. Equally, ‘dirty’ data will
information from the data set while saving on analysis and produce wrong results or output that cannot be relied upon.
storage costs.
It’s imperative to have clean data to fit in the machine learning
2.5.3 Numerosity reduction models so as to have the models learn correctly and predict
In numerosity reduction data reduced and made feasible for
accurately. There is high chance that inaccurate results from
analysis through replacement of the original data with a model
of the data that preserves the integrity of the initial data [24]. machine learning models are caused by improperly prepared
Two statistical method are used in the creation of the input data. Therefore, for ensuring the explainability and
representational model. In the parametric method, regression reliability of machine learning predictive models that are used
and log-linear methods are sued in the development of the to develop intelligent systems, clean prepared data is
representational model. Non-parametric methods encompass significant.

www.ijcat.com 234
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

Digital data sources such as internet of things which is a major Computer Science, 161, 466-474. Retrieved from:
https://doi.org/10.1016/j.procs.2019.11.146
source of real-world data have noisy, inconsistent and missing
[15] Elgabry, O. (2019, March 1st). The Ultimate Guide to Data
data, which when used in predictive modelling using machine Cleaning. Retrieved on July 27, 2020 from:
learning functions can result to erroneous and inaccurate https://towardsdatascience.com/the-ultimate-guide-to-data-
results. Removal of such inconsistencies in input data cannot cleaning-3969843991d4
be overemphasized. Clean data which is formatted and [16] Gupta, A., & Merchant, P. S. (2016). Automated lane detection
by k-means clustering: a machine learning approach. Electronic
organized to the required standard of the machine learning Imaging, 2016(14), 1-6. Retrieved from:
function goes a long way in contributing towards better https://doi.org/10.2352/ISSN.2470-1173.2016.14.IPMVA-386
machine learning models with reliable results. There is more to [17] Castañón, J. (2019, May 2nd). 10 Machine Learning Methods that
data preparation than has been included on this work. In future, Every Data Scientist Should Know. Retrieved on 26th July 2020,
from: https://towardsdatascience.com/10-machine-learning-
we look to define different types of data and their various pre- methods-that-every-data-scientist-should-know-3cc96e0eeee9
processing methods. [18] Malik, K. R., Ahmad, T., Farhan, M., Aslam, M., Jabbar, S.,
Khalid, S., & Kim, M. (2016). Big-data: transformation from
5. ACKNOWLEDGMENTS heterogeneous data to semantically-enriched simplified
data. Multimedia Tools and Applications, 75(20), 12727-12747.
My thanks to all authors whom I have referenced here below Retrieved from: https://doi.org/10.1007/s11042-015-2918-5
for their research works, which was insightful and helped to [19] Microsoft. (2020, April, 7th). Bias in Machine Learning.
compile above findings. Retrieved on July 31, 2020 from:
https://devblogs.microsoft.com/premier-developer/bias-in-
machine-learning/
6. REFERENCES [20] Ramírez‐Gallego, S., García, S., Mouriño‐Talín, H., Martínez‐
[1] applications and research directions." SN Computer Science 2,
Rego, D., Bolón‐Canedo, V., Alonso‐Betanzos, A., ... & Herrera,
no. 3 (2021): 1-21.
F. (2016). Data discretization: taxonomy and big data
[2] Altexsoft. (2018, June 16). Preparing Your Dataset for Machine challenge. Wiley Interdisciplinary Reviews: Data Mining and
Learning: 8 Basic Techniques That Make Your Data Better. Knowledge Discovery, 6(1), 5-21. Retrieved from:
Retrieved on July 29, 2020 from: https://doi.org/10.1002/widm.1173
https://www.altexsoft.com/blog/datascience/preparing-your-
[21] Swamy, M. K., & Reddy, P. K. (2020). A model of concept
dataset-for-machine-learning-8-basic-techniques-that-make-
hierarchy-based diverse patterns with applications to
your-data-better/
recommender system. International Journal of Data Science and
[3] Bengfort, B., & Kim, J. (2016). Data analytics with Hadoop: an Analytics, 1-15. Retrieved from: https://doi.org/10.1007/s41060-
introduction for data scientists. " O'Reilly Media, Inc.". 019-00203-2
[4] El-Amir, H., & Hamdy, M. (2020). Data Wrangling and [22] Shen, H., Zhang, M., & Shen, J. (2017). Efficient privacy
Preprocessing. In Deep Learning Pipeline (pp. 147-206). preserving cube-data aggregation scheme for smart grids. IEEE
Apress, Berkeley, CA. Retrieved from: Transactions on Information Forensics and Security, 12(6), 1369-
https://doi.org/10.1007/978-1-4842-5349-6_ 1381. Retrieved from:
[5] García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing https://ieeexplore.ieee.org/document/7828093
in data mining (Vol. 72, pp. 59-139). Cham, Switzerland: [23] Demisse, G. B., Tadesse, T., & Bayissa, Y. (2017). Data Mining
Springer International Publishing. Attribute Selection Approach for Drought Modeling: A Case
[6] Brownlee, J. (2020). Data preparation for machine learning: data Study for Greater Horn of Africa. arXiv preprint
cleaning, feature selection, and data transforms in Python. arXiv:1708.05072. retrieved from:
Machine Learning Mastery. https://arxiv.org/ct?url=https%3A%2F%2Fdx.doi.org%2F10.51
[7] Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data 21%2Fijdkp.2017.7401&v=2a6e454a
collection for machine learning: a big data-ai integration [24] Deepak, J. (n.d.). Numerosity Reduction in Data Mining.
perspective. IEEE Transactions on Knowledge and Data Retrieved on July 25, 2020 from:
Engineering. Retrieved from: https://www.geeksforgeeks.org/numerosity-reduction-in-data
https://ieeexplore.ieee.org/abstract/document/8862913 mining.
[8] Ho, D., Liang, E., Liaw. R., (2019, June 7). 1000x Faster Data [25] Liam, H. (2020, July 20th). 7 Types of Data Bias in Machine
Augmentation. Berkeley Artificial Intelligence Research. Learning. Retrieved on July 31, 2020 from:
Retrieved on July 29, 2020. https://lionbridge.ai/articles/7-types-of-data-bias-in-machine-
[9] Antoniou, A., Storkey, A., & Edwards, H. (2017). Data learning/
augmentation generative adversarial networks. arXiv preprint
arXiv:1711.04340.
[10] Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image
data augmentation for deep learning. Journal of Big Data, 6(1),
60. Retrieved from: https://doi.org/10.1186/s40537-019-0197-0
[11] Murata, K., Noda, H., & Haraguchi, M. (2017). U.S. Patent No.
9,558,151. Washington, DC: U.S. Patent and Trademark Office.
Retrieved from:
https://patents.google.com/patent/US9558151B2/en
[12] Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data
analysis as a foundation of inductive research. Human Resource
Management Review, 27(2), 265-276. Retrieved from:
https://doi.org/10.1016/j.hrmr.2016.08.003
[13] Van der Loo, M., & de Jonge, E. (2017). deductive: Data
Correction and Imputation Using Deductive Methods. R package
version 0.1, 2.
[14] Gupta, S., & Gupta, A. (2019). Dealing with Noise Problem in
Machine Learning Data-sets: A Systematic Review. Procedia

www.ijcat.com 235

Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Data Preparation For Automated Machine Learning: White Paper
No ratings yet
Data Preparation For Automated Machine Learning: White Paper
21 pages
Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
ML 1
No ratings yet
ML 1
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
NN 7
No ratings yet
NN 7
26 pages
Preparing Data For Machine Learning - Pluralsight PDF
No ratings yet
Preparing Data For Machine Learning - Pluralsight PDF
74 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Unit 2
No ratings yet
Unit 2
18 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
11 pages
AML Unit-1
No ratings yet
AML Unit-1
14 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
Data Preparation For Data Mining: Applied Artificial Intelligence
No ratings yet
Data Preparation For Data Mining: Applied Artificial Intelligence
8 pages
Machine Learning with Python
100% (1)
Machine Learning with Python
31 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Data Preparation: January 2017
No ratings yet
Data Preparation: January 2017
15 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Main Dock Pin
No ratings yet
Main Dock Pin
31 pages
Six Steps To Master Machine Learning With Data Preparation
No ratings yet
Six Steps To Master Machine Learning With Data Preparation
44 pages
Adbm Mid-2 QB
No ratings yet
Adbm Mid-2 QB
30 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Improve Model Accuracy With Data Pre-Processing
No ratings yet
Improve Model Accuracy With Data Pre-Processing
11 pages
Data Preparation Activity
No ratings yet
Data Preparation Activity
7 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
ML 2022
No ratings yet
ML 2022
10 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
Extracting Knowledge From Data
No ratings yet
Extracting Knowledge From Data
16 pages
E-Notes 33718 Content Document 20250325122736PM
No ratings yet
E-Notes 33718 Content Document 20250325122736PM
18 pages
TIS - Intro To Machine Learning
No ratings yet
TIS - Intro To Machine Learning
18 pages
Supervised Learning Research Paper With Images
No ratings yet
Supervised Learning Research Paper With Images
10 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
ML Interactively
No ratings yet
ML Interactively
273 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
03 - AIbDS - I - DP - WT - 2024 - 25 2
No ratings yet
03 - AIbDS - I - DP - WT - 2024 - 25 2
43 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
Unit 1
No ratings yet
Unit 1
14 pages
ML Module I
No ratings yet
ML Module I
71 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Chương
No ratings yet
Chương
12 pages
148-MLP-RL-CRD Diagnosis of Cardiovascular Risk in Athletes Using A Reinforcement Learning-Based Multilayer Perceptron
No ratings yet
148-MLP-RL-CRD Diagnosis of Cardiovascular Risk in Athletes Using A Reinforcement Learning-Based Multilayer Perceptron
16 pages
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
No ratings yet
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
9 pages
Machine Learning On Biomedical Images: Interactive Learning, Transfer Learning, Class Imbalance, and Beyond
No ratings yet
Machine Learning On Biomedical Images: Interactive Learning, Transfer Learning, Class Imbalance, and Beyond
6 pages
9-A Novel Ensemble Learning Paradigm For Medical Diagnosis With Imbalanced Data
No ratings yet
9-A Novel Ensemble Learning Paradigm For Medical Diagnosis With Imbalanced Data
18 pages
135 - Precision Clinical Medicine Through Machine Learning Using High and Low Quantile Ranges of Vital Signs For Risk Stratification of ICU Patients
No ratings yet
135 - Precision Clinical Medicine Through Machine Learning Using High and Low Quantile Ranges of Vital Signs For Risk Stratification of ICU Patients
13 pages
1 s2.0 S016786551730257X Main
No ratings yet
1 s2.0 S016786551730257X Main
7 pages
Guide to Handling Imbalanced Data
No ratings yet
Guide to Handling Imbalanced Data
18 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Matrix Confusion
No ratings yet
Matrix Confusion
25 pages
Pseudo-Negative Sampling in Bioinformatics
No ratings yet
Pseudo-Negative Sampling in Bioinformatics
13 pages
Classification Metrics Overview
No ratings yet
Classification Metrics Overview
13 pages
An Introduction To ROC Analysis
100% (1)
An Introduction To ROC Analysis
14 pages
2nd Year Maths Chapter 3 Soulution NOTESPK
No ratings yet
2nd Year Maths Chapter 3 Soulution NOTESPK
48 pages
Analysis of THeodore PDF
100% (1)
Analysis of THeodore PDF
15 pages
Quasi-Static Testing of Posttensioned Nonemulative Column-Footing Connections For Bridge Piers
No ratings yet
Quasi-Static Testing of Posttensioned Nonemulative Column-Footing Connections For Bridge Piers
13 pages
PT2 Datesheet 2024-25
No ratings yet
PT2 Datesheet 2024-25
1 page
Akapulko: Herbal Medicines Approved by DOH
No ratings yet
Akapulko: Herbal Medicines Approved by DOH
4 pages
Installation Guide: TPM (Trusted Platform Module)
No ratings yet
Installation Guide: TPM (Trusted Platform Module)
20 pages
Finite Diference Solutions of Seepage Problems
No ratings yet
Finite Diference Solutions of Seepage Problems
17 pages
Author's Purpose
No ratings yet
Author's Purpose
14 pages
Sylabus Class Ukg
No ratings yet
Sylabus Class Ukg
33 pages
Abacus Guide Book
No ratings yet
Abacus Guide Book
23 pages
Judicial Discipline and Impeachment
No ratings yet
Judicial Discipline and Impeachment
8 pages
Ee Time Now and Then: How Has Leisure Time Changed Over The Years?
No ratings yet
Ee Time Now and Then: How Has Leisure Time Changed Over The Years?
15 pages
British Constitution 2
No ratings yet
British Constitution 2
16 pages
Rizal's Life and His Works: Course Outline / Syllabus
No ratings yet
Rizal's Life and His Works: Course Outline / Syllabus
2 pages
Trends Report 2024 1.1
No ratings yet
Trends Report 2024 1.1
53 pages
Genetics Agriculture and Biotechnology 1646343300
No ratings yet
Genetics Agriculture and Biotechnology 1646343300
160 pages
John Alexander Dowie
No ratings yet
John Alexander Dowie
6 pages
Government As A Social Institution - Function, Nature of Relation With Organisation and Individuals
No ratings yet
Government As A Social Institution - Function, Nature of Relation With Organisation and Individuals
7 pages
Dialog Bahasa Inggris 1 2
No ratings yet
Dialog Bahasa Inggris 1 2
7 pages
Lives That Inspire, Volume 2 PDF
100% (1)
Lives That Inspire, Volume 2 PDF
132 pages
Death Keeps No Calendar - Dating Mortuary Hardware From The Saints
100% (1)
Death Keeps No Calendar - Dating Mortuary Hardware From The Saints
200 pages
January 2013 Agriculture Extension Applications
No ratings yet
January 2013 Agriculture Extension Applications
28 pages
NASA - Materials Selection For Aerospace Systems
No ratings yet
NASA - Materials Selection For Aerospace Systems
80 pages
Abdi Ismail Samtar
No ratings yet
Abdi Ismail Samtar
17 pages
Cirrus Solar Growth Challenges & Solutions
No ratings yet
Cirrus Solar Growth Challenges & Solutions
8 pages
Concept Paper GROUP 4
No ratings yet
Concept Paper GROUP 4
4 pages
TPM
No ratings yet
TPM
42 pages
ANGL - TA - Writing An Essay
No ratings yet
ANGL - TA - Writing An Essay
2 pages
Limit Fungsi Aljabar - Mathematics - Quizizz
No ratings yet
Limit Fungsi Aljabar - Mathematics - Quizizz
4 pages
Todd Parr Kit
100% (3)
Todd Parr Kit
12 pages

Data Preparation For Machine Learning Mo

Uploaded by

Data Preparation For Machine Learning Mo

Uploaded by

International Journal of Computer Applications Technology and Research

Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656

Data Preparation for Machine Learning Modelling

Ndung’u Rachael Njeri

Keywords: Data Preparation; Data pre-processing; Machine Learning; Predictive models

1. INTRODUCTION 2.1 DATA PREPARATION

Data discretization by correlation analysis is highly dependent

You might also like