KEMBAR78
Data Preparation For Machine Learning Mo | PDF | Machine Learning | Cluster Analysis
0% found this document useful (0 votes)
29 views5 pages

Data Preparation For Machine Learning Mo

The document discusses the importance of data preparation for machine learning modeling, emphasizing that well-prepared data is essential for accurate predictive outcomes. It outlines key steps in data preparation, including data collection, cleaning, transformation, and reduction, while highlighting techniques such as data augmentation and exploratory data analysis. The paper underscores the 'garbage-in-garbage-out' principle, stressing that the quality of input data directly affects the performance of machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Data Preparation For Machine Learning Mo

The document discusses the importance of data preparation for machine learning modeling, emphasizing that well-prepared data is essential for accurate predictive outcomes. It outlines key steps in data preparation, including data collection, cleaning, transformation, and reduction, while highlighting techniques such as data augmentation and exploratory data analysis. The paper underscores the 'garbage-in-garbage-out' principle, stressing that the quality of input data directly affects the performance of machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications Technology and Research

Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656


DOI:10.7753/IJCATR1106.1008

Data Preparation for Machine Learning Modelling

Ndung’u Rachael Njeri


Information Technology Department
Murang’a University of Technology
Murang’a, Kenya

Abstract: The world today is on revolution 4.0 which is data-driven. The majority of organizations and systems are using data to solve
problems through use of digitized systems. Data lets intelligent systems and their applications learn and adapt to mined insights without
been programmed. Data mining and analysis requires smart tools, techniques and methods with capability of extracting useful patterns,
trends and knowledge, which can be used as business intelligence by organizations as they map their strategic plans. Predictive intelligent
systems can be very useful in various fields as solutions to many existential issues. Accurate output from such predictive intelligent
systems can only be ascertained by having well prepared data that suits the predictive machine learning function. Machine learning
models learns from data input using the ‘garbage-in-garbage-out’ concept. Cleaned, pre-processed and consistent data would produce
accurate output as compared to inconsistent, noisy and erroneous data.

Keywords: Data Preparation; Data pre-processing; Machine Learning; Predictive models

1. INTRODUCTION 2.1 DATA PREPARATION


The world is witnessing a fourth industrial revolution, which is Data preparation is the process of converting raw data through
fast-paced due to technological evolutions and advancements. pre-processing before being used in fitting and evaluating
Today, digital systems are been experienced in all spheres of machine learning predictive systems [6]. Machine learning
the industries including and not limited to healthcare, models are particular to their data source, and hence the
education, manufacturing, entertainment, and credibility of the data source and utility of the data collected is
telecommunication where there’s a wealth of data. The digital essential. It is plausible for a machine learning model to be high
systems have become sources of massive data, where insights end model but training it with the wrong data yields the wrong
can be extracted and analyzed for new patterns and new information. Machine learning models operate on the “garbage
knowledge that may be useful in building various smart in, garbage out” philosophy, and data scientists ensure the
applications in the pertinent domains. “garbage in” remains relevant, for the resultant information to
be relevant. Standardizing your data entry point ensures the
2. Data Pre-processing right information is attained at the end result. For these reasons,
Data pre-processing is an important step while developing
data collection remains an imperative part of data preparation.
smart systems or while extracting meaningful insights using
machine learning. Data processing is sometimes used
Data preparation ascertains minimal errors in your data, and
interchangeably with data preparation; however, data
allows for data monitoring of any future errors. This will
processing is inclusive of both data preparation and feature
eventual ensure the machine learning is trained with the correct
engineering whereas data preparation excludes feature
data and hence the output will be accurate. Data exploration
engineering [4]. Before data preparation, there is usually need
analysis will provide a summary of your data set, and allow for
to understand the output you require from the machine model
necessary changes or formatting to be done. Any data source in
to be trained, and hence the subsequent data attributes that will
machine learning is divide into both the training and the test
shape the output. With the output in mind, the data to be
data, and the technique of this division is achieved during data
collected is easily identifiable, and thus its quality and value
preparation. Additionally, data preparation helps in shaping the
requirements defined. This problem articulation ascertains the
data to fit the requirements of the machine learning model.
right steps of data preparation are followed.
Some data sets have attributes that are not well ordered for
The data pre-processing involves data cleaning, which involves
analysis. Other times, the ranges in the data sets to be compared
removal of ‘dirt’ or noise in data, removal of missing or
largely vary, resulting to comparison challenges. Data
inconsistent data, data integration if data is sourced from
transformation allows for such data sets to be transformed into
multiple sources, data transformations depending on the type of
good representations of the initial data source, without losing
raw data to what the machine learning algorithms can use as
data relevancy or data integrity. Some training models accept
inputs, data reduction where unnecessary data is removed and
input data in certain formats, necessitating data transformation.
only data that is required to develop an application is retained
[5]. Data pre-processing makes sure that the data types to use
In an era of big data, there is need to create better storage
in machine learning functions are transformed, an imposition
techniques and often times this is costly, both in terms of
requirement by some machine learning algorithms on data,
storing the big data, and in analyzing it. Big data analytics
with some having non-linear relationships that complicates
require complex software which is expensive. Data reduction
how the algorithms functions [6].
comes in handy in compressing data into more manageable
volumes while retaining its relevance and integrity.
Additionally, the reduced volumes can be used in computations
as a representation of the whole data set with trivial to zero

www.ijcat.com 231
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

impact on the initial data source, and the output of the model. 2.2.4 Data aggregation
Data reduction reduces the overall cost of data analysis, and Data aggregation is a technique of reducing the volume of data
saves on the time that would have otherwise been employed in though grouping. This grouping is usually of a single attribute.
future data processing. For instance, when one has a data set with the attribute time
organized in days over a given time series, one can aggregate
The main four steps for data preparation are data collection, the data into monthly groups which eases dealing with the time
data cleaning, data transformation and data reduction. attribute. It aids in reducing the broadness of a given attribute
without tangible losses during future data manipulation [10].
2.2 DATA COLLECTION
Data collection is the initial stage of data preparation, and it 2.3 DATA CLEANING
involves deciding on the data set depending on the expected Data cleaning, also referred to as data cleansing is the technique
output of the machine model to be trained. Essentially, of detecting and correcting errors and inaccuracies in the
collection of the right data set ascertains the right data output. collected data [11]. Data is supposed to be consistent with the
Data collection consists of data acquisition, data labeling, data input requirement of the machine learning model. The main
augmentation, data integration and data aggregation. activities in data cleansing involve the fine-tuning of the noisy
data and dealing with missing data. It aids in ensuring the
2.2.1 Data acquisition. collected data set is comprehensive and any errors and biases
Data acquisition involves identifying the data source, defining that may have arose in data collection have been eliminated.
the methodology of collecting the data, and converting the This includes the detection of outliers within the data set; both
collected data into digital form for computation. The data for the numerical and the non-numerical data sets.
source can be primary, where data is obtained straight from the
persons, objects or processes being studied. When your data In
this stage, exploratory data analysis (EDA) is used, and it is a 2.3.1 Exploratory Data Analysis
technique that aims at understanding the characteristics and on the information that can be attained from the collected data,
attributes of the data sets [12]. It aids in the data scientist and sometimes involves data visualization. Data visualization
becoming more familiarized with the data collected. In allows for the understanding of data properties as skewness and
exploratory data analysis, statistical tools and techniques are outliers.
applied in building hypothesis source is a party that had
previously collected data, it is termed as a secondary source. Exploratory data analysis is mainly done on the statistical
Methodology of data collection varies depending on the manipulation software. The graphical techniques allow for
expected output. Statistical tools and techniques are applied in understanding the distribution of the data set, and the statistical
both the collection of qualitative and quantitative data. summary of all attributes. EDA allows for future decisions such
as the data cleansing techniques to be used, what data
transformations are necessary and whether data reduction is
2.2.2 Data labelling necessary and if yes, what is technique to use. Exploratory data
As machine learning advances, there is development of deep analysis is a continuous process all through data preparation.
learning techniques which have automated the generation of
features from data sets, and hence the requirement of high 2.3.2 Missing Data
volumes labelled data [7]. Data labelling is the process through While it is important to ascertain during data collection that all
which the data models are trained through tagging of data the attributes of the data sets have their real value collected,
samples. For instance, if a model is expected to tell the data sometimes has some of the attributes with missing values,
difference between images of cats and dogs, it will be initially which makes it hard to use as input in machine learning models.
introduced to images of cats and dogs, which are tagged as As so, different techniques have been outlined on how to deal
either cats or dogs. This is done manually, though often with with missing data. Data manipulation platforms as python and
the aid of a software. This part of supervised learning allows R statistics have some of these techniques of dealing with
the model to form a basis of future learning. The initial missing data embedded in them. The best technique usually
formation of a pattern in both the input and output data, defines varies with the data set, and hence after data assessment in the
the requirements of the data to be collected. Therefore, before exploratory data analysis, one can easily select the best
data collection is initialized, there is need to delineate the data technique for missing data imputation.
parameters and the intended information to be retrieved from 2.3.2.1 Deductive Imputation
the data. Deductive imputation follows the basic rule of logic, and is
2.2.3 Data augmentation hence the easiest imputation, however, the most time
Data augmentation is a data preparation strategy that is used in consuming. Even so, its results are usually highly accurate. For
increasing data diversity for deep learning model training [8]. instance, if student data indicates that the total number of
It involves construction of iterative optimization with the aim students is 10, and the total number of examinations papers is
of developing new training data from already existing data. It 10, but there is a paper with a missing name and John has no
allows for the introduction of unobserved data or introduction marks recorded, logic dictates the nameless paper is John’s.
of variables that are inferred through mathematical models [9]. However, deductive imputation is not applicable in all types of
While not always necessary, it is essential when the data being data sets [13].
trained is complex and the available volume of sampled data is 2.3.2.2 Mean/Median/Mode Imputation
small. Data augmentation saves the problem of limited data and This imputation uses statistical techniques where the central
model overfitting [10]. measures of tendency within a certain attribute are computed
and the missing values replaced with the computed measure of
central tendency, may it be mean, mode or the median of that
attribute [13]. This technique is applied in numerical data sets,

www.ijcat.com 232
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

and the impact on the output or later computations is trivial. 2.4 DATA TRANSFORMATION
Data manipulation platforms as python and R statistics have Data transformation involves shifting the cleansed data from
techniques of dealing with missing data embedded in them. one format to the next, from one structure to the next, or
changing the values in the cleansed data set to meet the
2.3.3 Noisy Data. requirements of the machine learning model [18]. The
Presence of noisy data can have substantial effect on the output simplicity of the data transformation is highly dependent on the
of a machine model. It negatively impacts on prediction of required data for input, and the available data set. Data
information, ranking results, and the accuracy in clustering and transformation involves:
classification [14]. Noisy data includes unnecessary
information in the data, redundant data values and duplicates or 2.4.1 Normalization
pointless data values. These result from faultiness in collection Normalization is a technique for data transformation that is
of data, problems that may result from data entry, problems that applied in numeric values of columns when there is for a
occur from data transfer techniques applied, uneven naming common scale. This transformation is achieved without loss of
conventions of the data and sometimes it may arise from information, but only changing how it is represented. For
technology restrictions, as in the case of unstructured data. instance, in a data set with two columns that have different
Noisy data is eliminated through. scales such as one with values ranging from 100 to 1,000 and
another column with a value range of 10,0000 to 1,000,000
2.3.3.1 Binning Method there may arise a difficulty in the event that the two columns
This involves arranging data into groups of given intervals, and have to be used together in machine learning modelling.
is used in smoothening ordered data. The binning method relies Normalization finds a solution by finding a way of representing
on the measures of central tendency and it is done in one of the same information without loss of distribution or ratios from
three ways. Smoothing by bin means, smoothing by bin median the initial data set [19].
and smoothing by bin boundary. It is imperative to note that while normalization is only
2.3.3.2 Regression necessitated by the nature of some data sets, other times it is
Linear Regression is a statistical and supervised machine demanded by the machine learning algorithms being used.
learning technique, that predicts particular data based on Normalization uses different mathematical techniques such as
existing data [15]. Simple linear regression is used to compute z-score in data standardization. The technique picked is usually
the best line of fit based on existing data, and hence outliers in decided depending on the nature and characteristics of the
the data can be identified. To attain the best line fit, there is dataset. Therefore, it is decided at the exploratory data analysis
development of the regression function based on the prior stage.
collected data. However, it is important to note that though in
some data sets, extreme outliers are considered noisy data, the 2.4.2 Attribute selection
outliers can be essential to the model. In this transformation, latent attributes are created based on the
For instance, if an online retailer company has its market available attributes in the data set to facilitate the data mining
within countries in Europe and trivial market in the United process [18]. The latent attributes created usually have no
States, the United States may be considered an extreme outlier, impact on the initial data source, and therefore can be ignored
and hence noisy data. However, a machine learning model may afterwards. Attribute transformation usually facilitates
realize that though a very small number of the Americans use classification, clustering and regression algorithms. Basic
the online platform, they bring in more revenue than some of attribute transformation involves decomposition of the
the countries in Europe. Simple linear regression uses one available attributes through arithmetic or logical operations.
independent variable whereas multiple linear regression uses For instance, a data set with a time attribute given in months,
more than one independent variable in its computations. can have its month attribute decomposed to weeks, or
aggregated to years depending on the requirements.
2.3.3.3 Clustering
Clustering is in the unsupervised machine learning category 2.4.3 Discretization
and it operates by basically grouping the collected data set into In data transformation by discretization, there is creation of
clusters, based on their attributes (Gupta & Merchant, 2016). In intervals or labels, and eventual mapping of the all data points
clustering, the outliers in the data may fall within the clusters, to the created data intervals or labels. The data in question is
and in the case that they are extreme outliers they fall outside customarily numeric data. There are different statistical
the clusters. To understand the effect of clustering, data techniques used in discretization of data sets. The binning
visualization techniques are used “Clustering methods don’t method is used on ordered data, where the data is creation of
use output information for training, but instead let the algorithm data intervals called bins where all the data points are mapped
define the output” [17]. There are different techniques used in into. In data discretization by histogram analysis, histograms
clustering. are used in dividing the values of the attribute into disjoint
ranges where all other data points are mapped to. Both binning
In K-means clustering, K is the number of clusters to be made, and histogram analysis are unsupervised data discretization
and to do this the algorithm randomly selects K number of data methods.
points from the data set. These K data points are called the
centroids of the data, and every other data point in the data set In data discretization by decision tree analysis, the algorithm
is assigned to the closest centroid. This process is repeated for picks the attribute with the minimum entropy, and uses its
all the new K data sets created, and the process iterated until minimum value as the point from which it, in iterations,
the centroids become constant, or fairly constant. This is called partitions the resulting intervals till it attains as many different
the point at which convergence occurs. The Density-Based groups as possible [20]. This discretization is hierarchical
Clustering of Applications with Noise (DBSCAN) is used in hence its name. To use an analogy, it’s like dividing a room
data set smoothing. into two equal parts, and continuously dividing the resulting
partitions into two other equal parts. Only in this case, the room
has multi-varied contents and we want each different content in

www.ijcat.com 233
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

its own space at the end of the partitioning. This discretization the use of clustering, sampling, use of histograms and data cube
technique uses a top-down approach and is a supervised aggregation to represent the whole data population, during
algorithm. computations and storage.

Data discretization by correlation analysis is highly dependent


on mathematical tools and it applies a bottom-up approach,
3. POSSIBLE BIASES IN DATA
unlike decision trees [20]. It maps data points to data intervals PREPERATION
by the best neighboring interval for each data point, and Bias in the data to be trained in the machine learning model
merging the intervals. It then recursively repeats the process to leads to consequent wrong information output. It is imperative
create one large interval. It is a supervised machine learning to identify the source of any bias in your data set during data
methodology. preparation and eliminate the bias [25]. Sample bias occurs at
2.4.4 Concept Hierarchy Generation data collection where the selected data sample is not the right
In concept hierarchy data transformation, there is mapping of representation of the population under study, hence it is also
low-level concepts within the attributes to higher level concepts called selection bias. For instance, an iris scan recognition
[21]. Most of these concepts are normally implied in the initial trained entirely on the iris scans of Africans will not efficiently
data set, and hence the technique is embedded in statistical identify eyes of the white population.
software. It follows a bottom up approach. For instance, in the
location dimension, cities can be mapped to their states, their
Exclusion bias is common in the data cleansing stage where
provinces, their countries and eventually their continents.
. there is deletion, or misrepresentation of a part of the data,
leading to it being excluded in the model training.
2.5 DATA REDUCTION Measurement bias occurs either during data collection, where
With the advancement of trends in information technology and the system of collecting input data is not the same as that of
the exponential growth of internet of things, there has been an collecting output data. Additionally, it occurs during data
eventual precipitous increase in the volumes of available data. labelling, where non-uniform data labelling results to faulty
This is a huge benefit to machine learning as the availability of
predictions from the machine learning model. Recall bias also
big data for training the models ascertains accuracies in the
outputted information from such models. Nonetheless, occurs at the data labelling stage, where the labelling is non-
handling and analyzing these enormous volumes of data is a big consistent [25].
challenge, hence the need for data reduction techniques. Data
reduction reduces the cost of analyzing and storing these Observer bias is data fallacy where the person dealing with the
volumes of data by increasing storage efficiency. The different data assumes the observation to be wat they expected, as
techniques used in data reduction include. opposed to the real observation. Data scientists and researchers
2.5.1 Data cube aggregation are encouraged to operate on an objective rather than subjective
A data cube is an n-dimensional array that uses mathematical approach to avoid this bias [19]. Another is racial bias, and the
tensors to represent information. the online analytical best example of this bias in talk balk engines, where the model
processing (OLAP) cube stores data in a multidimensional was largely trained on the voice data of the white population,
form, which occupies lesser storage space compared to a and hence it hardly recognizes the voice of the black data
unidimensional storage technique [22]. To access data from the
population [19]. Association bias occurs when a data set has
OLAP cube, the Multidimensional expressional (MDX) query
language is used. The query language includes the roll-up, drill- created an implicit association between attributes. The main
down, slice and dice and pivot operations. These operations association bias is the gender bias, as in the case where a system
allow access to the required attributes of the data from the cube, is trained with all school principals being males, and hence
without removing the data from the data cube, hence saving on eventually disqualifies the plausibility of a female school
space. principle [25].
2.5.2 Attribute subset selection
Attribute subset selection, also known as feature selection is a 4. CONCLUSION
part of feature engineering and it involves the discovery of the Many machine learning predictive systems and models are
smallest possible subset of attributes that would yield the same affected by the kind of data that is used as input of the models.
results or closest to the same results on data mining, as when Results of the predictive models are determined by the machine
using all the attributes [23]. This technique ensures that only
learning algorithm function and the kind of data input. Biased
what is completely necessary from the initial data set is used in
the modeling. This simplifies detection of insights, patterns and data will produce biased results. Equally, ‘dirty’ data will
information from the data set while saving on analysis and produce wrong results or output that cannot be relied upon.
storage costs.
It’s imperative to have clean data to fit in the machine learning
2.5.3 Numerosity reduction models so as to have the models learn correctly and predict
In numerosity reduction data reduced and made feasible for
accurately. There is high chance that inaccurate results from
analysis through replacement of the original data with a model
of the data that preserves the integrity of the initial data [24]. machine learning models are caused by improperly prepared
Two statistical method are used in the creation of the input data. Therefore, for ensuring the explainability and
representational model. In the parametric method, regression reliability of machine learning predictive models that are used
and log-linear methods are sued in the development of the to develop intelligent systems, clean prepared data is
representational model. Non-parametric methods encompass significant.

www.ijcat.com 234
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008

Digital data sources such as internet of things which is a major Computer Science, 161, 466-474. Retrieved from:
https://doi.org/10.1016/j.procs.2019.11.146
source of real-world data have noisy, inconsistent and missing
[15] Elgabry, O. (2019, March 1st). The Ultimate Guide to Data
data, which when used in predictive modelling using machine Cleaning. Retrieved on July 27, 2020 from:
learning functions can result to erroneous and inaccurate https://towardsdatascience.com/the-ultimate-guide-to-data-
results. Removal of such inconsistencies in input data cannot cleaning-3969843991d4
be overemphasized. Clean data which is formatted and [16] Gupta, A., & Merchant, P. S. (2016). Automated lane detection
by k-means clustering: a machine learning approach. Electronic
organized to the required standard of the machine learning Imaging, 2016(14), 1-6. Retrieved from:
function goes a long way in contributing towards better https://doi.org/10.2352/ISSN.2470-1173.2016.14.IPMVA-386
machine learning models with reliable results. There is more to [17] Castañón, J. (2019, May 2nd). 10 Machine Learning Methods that
data preparation than has been included on this work. In future, Every Data Scientist Should Know. Retrieved on 26th July 2020,
from: https://towardsdatascience.com/10-machine-learning-
we look to define different types of data and their various pre- methods-that-every-data-scientist-should-know-3cc96e0eeee9
processing methods. [18] Malik, K. R., Ahmad, T., Farhan, M., Aslam, M., Jabbar, S.,
Khalid, S., & Kim, M. (2016). Big-data: transformation from
5. ACKNOWLEDGMENTS heterogeneous data to semantically-enriched simplified
data. Multimedia Tools and Applications, 75(20), 12727-12747.
My thanks to all authors whom I have referenced here below Retrieved from: https://doi.org/10.1007/s11042-015-2918-5
for their research works, which was insightful and helped to [19] Microsoft. (2020, April, 7th). Bias in Machine Learning.
compile above findings. Retrieved on July 31, 2020 from:
https://devblogs.microsoft.com/premier-developer/bias-in-
machine-learning/
6. REFERENCES [20] Ramírez‐Gallego, S., García, S., Mouriño‐Talín, H., Martínez‐
[1] applications and research directions." SN Computer Science 2,
Rego, D., Bolón‐Canedo, V., Alonso‐Betanzos, A., ... & Herrera,
no. 3 (2021): 1-21.
F. (2016). Data discretization: taxonomy and big data
[2] Altexsoft. (2018, June 16). Preparing Your Dataset for Machine challenge. Wiley Interdisciplinary Reviews: Data Mining and
Learning: 8 Basic Techniques That Make Your Data Better. Knowledge Discovery, 6(1), 5-21. Retrieved from:
Retrieved on July 29, 2020 from: https://doi.org/10.1002/widm.1173
https://www.altexsoft.com/blog/datascience/preparing-your-
[21] Swamy, M. K., & Reddy, P. K. (2020). A model of concept
dataset-for-machine-learning-8-basic-techniques-that-make-
hierarchy-based diverse patterns with applications to
your-data-better/
recommender system. International Journal of Data Science and
[3] Bengfort, B., & Kim, J. (2016). Data analytics with Hadoop: an Analytics, 1-15. Retrieved from: https://doi.org/10.1007/s41060-
introduction for data scientists. " O'Reilly Media, Inc.". 019-00203-2
[4] El-Amir, H., & Hamdy, M. (2020). Data Wrangling and [22] Shen, H., Zhang, M., & Shen, J. (2017). Efficient privacy
Preprocessing. In Deep Learning Pipeline (pp. 147-206). preserving cube-data aggregation scheme for smart grids. IEEE
Apress, Berkeley, CA. Retrieved from: Transactions on Information Forensics and Security, 12(6), 1369-
https://doi.org/10.1007/978-1-4842-5349-6_ 1381. Retrieved from:
[5] García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing https://ieeexplore.ieee.org/document/7828093
in data mining (Vol. 72, pp. 59-139). Cham, Switzerland: [23] Demisse, G. B., Tadesse, T., & Bayissa, Y. (2017). Data Mining
Springer International Publishing. Attribute Selection Approach for Drought Modeling: A Case
[6] Brownlee, J. (2020). Data preparation for machine learning: data Study for Greater Horn of Africa. arXiv preprint
cleaning, feature selection, and data transforms in Python. arXiv:1708.05072. retrieved from:
Machine Learning Mastery. https://arxiv.org/ct?url=https%3A%2F%2Fdx.doi.org%2F10.51
[7] Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data 21%2Fijdkp.2017.7401&v=2a6e454a
collection for machine learning: a big data-ai integration [24] Deepak, J. (n.d.). Numerosity Reduction in Data Mining.
perspective. IEEE Transactions on Knowledge and Data Retrieved on July 25, 2020 from:
Engineering. Retrieved from: https://www.geeksforgeeks.org/numerosity-reduction-in-data
https://ieeexplore.ieee.org/abstract/document/8862913 mining.
[8] Ho, D., Liang, E., Liaw. R., (2019, June 7). 1000x Faster Data [25] Liam, H. (2020, July 20th). 7 Types of Data Bias in Machine
Augmentation. Berkeley Artificial Intelligence Research. Learning. Retrieved on July 31, 2020 from:
Retrieved on July 29, 2020. https://lionbridge.ai/articles/7-types-of-data-bias-in-machine-
[9] Antoniou, A., Storkey, A., & Edwards, H. (2017). Data learning/
augmentation generative adversarial networks. arXiv preprint
arXiv:1711.04340.
[10] Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image
data augmentation for deep learning. Journal of Big Data, 6(1),
60. Retrieved from: https://doi.org/10.1186/s40537-019-0197-0
[11] Murata, K., Noda, H., & Haraguchi, M. (2017). U.S. Patent No.
9,558,151. Washington, DC: U.S. Patent and Trademark Office.
Retrieved from:
https://patents.google.com/patent/US9558151B2/en
[12] Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data
analysis as a foundation of inductive research. Human Resource
Management Review, 27(2), 265-276. Retrieved from:
https://doi.org/10.1016/j.hrmr.2016.08.003
[13] Van der Loo, M., & de Jonge, E. (2017). deductive: Data
Correction and Imputation Using Deductive Methods. R package
version 0.1, 2.
[14] Gupta, S., & Gupta, A. (2019). Dealing with Noise Problem in
Machine Learning Data-sets: A Systematic Review. Procedia

www.ijcat.com 235

You might also like