Data Preparation For Machine Learning Mo
Data Preparation For Machine Learning Mo
Abstract: The world today is on revolution 4.0 which is data-driven. The majority of organizations and systems are using data to solve
problems through use of digitized systems. Data lets intelligent systems and their applications learn and adapt to mined insights without
been programmed. Data mining and analysis requires smart tools, techniques and methods with capability of extracting useful patterns,
trends and knowledge, which can be used as business intelligence by organizations as they map their strategic plans. Predictive intelligent
systems can be very useful in various fields as solutions to many existential issues. Accurate output from such predictive intelligent
systems can only be ascertained by having well prepared data that suits the predictive machine learning function. Machine learning
models learns from data input using the ‘garbage-in-garbage-out’ concept. Cleaned, pre-processed and consistent data would produce
accurate output as compared to inconsistent, noisy and erroneous data.
www.ijcat.com 231
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008
impact on the initial data source, and the output of the model. 2.2.4 Data aggregation
Data reduction reduces the overall cost of data analysis, and Data aggregation is a technique of reducing the volume of data
saves on the time that would have otherwise been employed in though grouping. This grouping is usually of a single attribute.
future data processing. For instance, when one has a data set with the attribute time
organized in days over a given time series, one can aggregate
The main four steps for data preparation are data collection, the data into monthly groups which eases dealing with the time
data cleaning, data transformation and data reduction. attribute. It aids in reducing the broadness of a given attribute
without tangible losses during future data manipulation [10].
2.2 DATA COLLECTION
Data collection is the initial stage of data preparation, and it 2.3 DATA CLEANING
involves deciding on the data set depending on the expected Data cleaning, also referred to as data cleansing is the technique
output of the machine model to be trained. Essentially, of detecting and correcting errors and inaccuracies in the
collection of the right data set ascertains the right data output. collected data [11]. Data is supposed to be consistent with the
Data collection consists of data acquisition, data labeling, data input requirement of the machine learning model. The main
augmentation, data integration and data aggregation. activities in data cleansing involve the fine-tuning of the noisy
data and dealing with missing data. It aids in ensuring the
2.2.1 Data acquisition. collected data set is comprehensive and any errors and biases
Data acquisition involves identifying the data source, defining that may have arose in data collection have been eliminated.
the methodology of collecting the data, and converting the This includes the detection of outliers within the data set; both
collected data into digital form for computation. The data for the numerical and the non-numerical data sets.
source can be primary, where data is obtained straight from the
persons, objects or processes being studied. When your data In
this stage, exploratory data analysis (EDA) is used, and it is a 2.3.1 Exploratory Data Analysis
technique that aims at understanding the characteristics and on the information that can be attained from the collected data,
attributes of the data sets [12]. It aids in the data scientist and sometimes involves data visualization. Data visualization
becoming more familiarized with the data collected. In allows for the understanding of data properties as skewness and
exploratory data analysis, statistical tools and techniques are outliers.
applied in building hypothesis source is a party that had
previously collected data, it is termed as a secondary source. Exploratory data analysis is mainly done on the statistical
Methodology of data collection varies depending on the manipulation software. The graphical techniques allow for
expected output. Statistical tools and techniques are applied in understanding the distribution of the data set, and the statistical
both the collection of qualitative and quantitative data. summary of all attributes. EDA allows for future decisions such
as the data cleansing techniques to be used, what data
transformations are necessary and whether data reduction is
2.2.2 Data labelling necessary and if yes, what is technique to use. Exploratory data
As machine learning advances, there is development of deep analysis is a continuous process all through data preparation.
learning techniques which have automated the generation of
features from data sets, and hence the requirement of high 2.3.2 Missing Data
volumes labelled data [7]. Data labelling is the process through While it is important to ascertain during data collection that all
which the data models are trained through tagging of data the attributes of the data sets have their real value collected,
samples. For instance, if a model is expected to tell the data sometimes has some of the attributes with missing values,
difference between images of cats and dogs, it will be initially which makes it hard to use as input in machine learning models.
introduced to images of cats and dogs, which are tagged as As so, different techniques have been outlined on how to deal
either cats or dogs. This is done manually, though often with with missing data. Data manipulation platforms as python and
the aid of a software. This part of supervised learning allows R statistics have some of these techniques of dealing with
the model to form a basis of future learning. The initial missing data embedded in them. The best technique usually
formation of a pattern in both the input and output data, defines varies with the data set, and hence after data assessment in the
the requirements of the data to be collected. Therefore, before exploratory data analysis, one can easily select the best
data collection is initialized, there is need to delineate the data technique for missing data imputation.
parameters and the intended information to be retrieved from 2.3.2.1 Deductive Imputation
the data. Deductive imputation follows the basic rule of logic, and is
2.2.3 Data augmentation hence the easiest imputation, however, the most time
Data augmentation is a data preparation strategy that is used in consuming. Even so, its results are usually highly accurate. For
increasing data diversity for deep learning model training [8]. instance, if student data indicates that the total number of
It involves construction of iterative optimization with the aim students is 10, and the total number of examinations papers is
of developing new training data from already existing data. It 10, but there is a paper with a missing name and John has no
allows for the introduction of unobserved data or introduction marks recorded, logic dictates the nameless paper is John’s.
of variables that are inferred through mathematical models [9]. However, deductive imputation is not applicable in all types of
While not always necessary, it is essential when the data being data sets [13].
trained is complex and the available volume of sampled data is 2.3.2.2 Mean/Median/Mode Imputation
small. Data augmentation saves the problem of limited data and This imputation uses statistical techniques where the central
model overfitting [10]. measures of tendency within a certain attribute are computed
and the missing values replaced with the computed measure of
central tendency, may it be mean, mode or the median of that
attribute [13]. This technique is applied in numerical data sets,
www.ijcat.com 232
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008
and the impact on the output or later computations is trivial. 2.4 DATA TRANSFORMATION
Data manipulation platforms as python and R statistics have Data transformation involves shifting the cleansed data from
techniques of dealing with missing data embedded in them. one format to the next, from one structure to the next, or
changing the values in the cleansed data set to meet the
2.3.3 Noisy Data. requirements of the machine learning model [18]. The
Presence of noisy data can have substantial effect on the output simplicity of the data transformation is highly dependent on the
of a machine model. It negatively impacts on prediction of required data for input, and the available data set. Data
information, ranking results, and the accuracy in clustering and transformation involves:
classification [14]. Noisy data includes unnecessary
information in the data, redundant data values and duplicates or 2.4.1 Normalization
pointless data values. These result from faultiness in collection Normalization is a technique for data transformation that is
of data, problems that may result from data entry, problems that applied in numeric values of columns when there is for a
occur from data transfer techniques applied, uneven naming common scale. This transformation is achieved without loss of
conventions of the data and sometimes it may arise from information, but only changing how it is represented. For
technology restrictions, as in the case of unstructured data. instance, in a data set with two columns that have different
Noisy data is eliminated through. scales such as one with values ranging from 100 to 1,000 and
another column with a value range of 10,0000 to 1,000,000
2.3.3.1 Binning Method there may arise a difficulty in the event that the two columns
This involves arranging data into groups of given intervals, and have to be used together in machine learning modelling.
is used in smoothening ordered data. The binning method relies Normalization finds a solution by finding a way of representing
on the measures of central tendency and it is done in one of the same information without loss of distribution or ratios from
three ways. Smoothing by bin means, smoothing by bin median the initial data set [19].
and smoothing by bin boundary. It is imperative to note that while normalization is only
2.3.3.2 Regression necessitated by the nature of some data sets, other times it is
Linear Regression is a statistical and supervised machine demanded by the machine learning algorithms being used.
learning technique, that predicts particular data based on Normalization uses different mathematical techniques such as
existing data [15]. Simple linear regression is used to compute z-score in data standardization. The technique picked is usually
the best line of fit based on existing data, and hence outliers in decided depending on the nature and characteristics of the
the data can be identified. To attain the best line fit, there is dataset. Therefore, it is decided at the exploratory data analysis
development of the regression function based on the prior stage.
collected data. However, it is important to note that though in
some data sets, extreme outliers are considered noisy data, the 2.4.2 Attribute selection
outliers can be essential to the model. In this transformation, latent attributes are created based on the
For instance, if an online retailer company has its market available attributes in the data set to facilitate the data mining
within countries in Europe and trivial market in the United process [18]. The latent attributes created usually have no
States, the United States may be considered an extreme outlier, impact on the initial data source, and therefore can be ignored
and hence noisy data. However, a machine learning model may afterwards. Attribute transformation usually facilitates
realize that though a very small number of the Americans use classification, clustering and regression algorithms. Basic
the online platform, they bring in more revenue than some of attribute transformation involves decomposition of the
the countries in Europe. Simple linear regression uses one available attributes through arithmetic or logical operations.
independent variable whereas multiple linear regression uses For instance, a data set with a time attribute given in months,
more than one independent variable in its computations. can have its month attribute decomposed to weeks, or
aggregated to years depending on the requirements.
2.3.3.3 Clustering
Clustering is in the unsupervised machine learning category 2.4.3 Discretization
and it operates by basically grouping the collected data set into In data transformation by discretization, there is creation of
clusters, based on their attributes (Gupta & Merchant, 2016). In intervals or labels, and eventual mapping of the all data points
clustering, the outliers in the data may fall within the clusters, to the created data intervals or labels. The data in question is
and in the case that they are extreme outliers they fall outside customarily numeric data. There are different statistical
the clusters. To understand the effect of clustering, data techniques used in discretization of data sets. The binning
visualization techniques are used “Clustering methods don’t method is used on ordered data, where the data is creation of
use output information for training, but instead let the algorithm data intervals called bins where all the data points are mapped
define the output” [17]. There are different techniques used in into. In data discretization by histogram analysis, histograms
clustering. are used in dividing the values of the attribute into disjoint
ranges where all other data points are mapped to. Both binning
In K-means clustering, K is the number of clusters to be made, and histogram analysis are unsupervised data discretization
and to do this the algorithm randomly selects K number of data methods.
points from the data set. These K data points are called the
centroids of the data, and every other data point in the data set In data discretization by decision tree analysis, the algorithm
is assigned to the closest centroid. This process is repeated for picks the attribute with the minimum entropy, and uses its
all the new K data sets created, and the process iterated until minimum value as the point from which it, in iterations,
the centroids become constant, or fairly constant. This is called partitions the resulting intervals till it attains as many different
the point at which convergence occurs. The Density-Based groups as possible [20]. This discretization is hierarchical
Clustering of Applications with Noise (DBSCAN) is used in hence its name. To use an analogy, it’s like dividing a room
data set smoothing. into two equal parts, and continuously dividing the resulting
partitions into two other equal parts. Only in this case, the room
has multi-varied contents and we want each different content in
www.ijcat.com 233
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008
its own space at the end of the partitioning. This discretization the use of clustering, sampling, use of histograms and data cube
technique uses a top-down approach and is a supervised aggregation to represent the whole data population, during
algorithm. computations and storage.
www.ijcat.com 234
International Journal of Computer Applications Technology and Research
Volume 11–Issue 06, 231-235, 2022, ISSN:-2319–8656
DOI:10.7753/IJCATR1106.1008
Digital data sources such as internet of things which is a major Computer Science, 161, 466-474. Retrieved from:
https://doi.org/10.1016/j.procs.2019.11.146
source of real-world data have noisy, inconsistent and missing
[15] Elgabry, O. (2019, March 1st). The Ultimate Guide to Data
data, which when used in predictive modelling using machine Cleaning. Retrieved on July 27, 2020 from:
learning functions can result to erroneous and inaccurate https://towardsdatascience.com/the-ultimate-guide-to-data-
results. Removal of such inconsistencies in input data cannot cleaning-3969843991d4
be overemphasized. Clean data which is formatted and [16] Gupta, A., & Merchant, P. S. (2016). Automated lane detection
by k-means clustering: a machine learning approach. Electronic
organized to the required standard of the machine learning Imaging, 2016(14), 1-6. Retrieved from:
function goes a long way in contributing towards better https://doi.org/10.2352/ISSN.2470-1173.2016.14.IPMVA-386
machine learning models with reliable results. There is more to [17] Castañón, J. (2019, May 2nd). 10 Machine Learning Methods that
data preparation than has been included on this work. In future, Every Data Scientist Should Know. Retrieved on 26th July 2020,
from: https://towardsdatascience.com/10-machine-learning-
we look to define different types of data and their various pre- methods-that-every-data-scientist-should-know-3cc96e0eeee9
processing methods. [18] Malik, K. R., Ahmad, T., Farhan, M., Aslam, M., Jabbar, S.,
Khalid, S., & Kim, M. (2016). Big-data: transformation from
5. ACKNOWLEDGMENTS heterogeneous data to semantically-enriched simplified
data. Multimedia Tools and Applications, 75(20), 12727-12747.
My thanks to all authors whom I have referenced here below Retrieved from: https://doi.org/10.1007/s11042-015-2918-5
for their research works, which was insightful and helped to [19] Microsoft. (2020, April, 7th). Bias in Machine Learning.
compile above findings. Retrieved on July 31, 2020 from:
https://devblogs.microsoft.com/premier-developer/bias-in-
machine-learning/
6. REFERENCES [20] Ramírez‐Gallego, S., García, S., Mouriño‐Talín, H., Martínez‐
[1] applications and research directions." SN Computer Science 2,
Rego, D., Bolón‐Canedo, V., Alonso‐Betanzos, A., ... & Herrera,
no. 3 (2021): 1-21.
F. (2016). Data discretization: taxonomy and big data
[2] Altexsoft. (2018, June 16). Preparing Your Dataset for Machine challenge. Wiley Interdisciplinary Reviews: Data Mining and
Learning: 8 Basic Techniques That Make Your Data Better. Knowledge Discovery, 6(1), 5-21. Retrieved from:
Retrieved on July 29, 2020 from: https://doi.org/10.1002/widm.1173
https://www.altexsoft.com/blog/datascience/preparing-your-
[21] Swamy, M. K., & Reddy, P. K. (2020). A model of concept
dataset-for-machine-learning-8-basic-techniques-that-make-
hierarchy-based diverse patterns with applications to
your-data-better/
recommender system. International Journal of Data Science and
[3] Bengfort, B., & Kim, J. (2016). Data analytics with Hadoop: an Analytics, 1-15. Retrieved from: https://doi.org/10.1007/s41060-
introduction for data scientists. " O'Reilly Media, Inc.". 019-00203-2
[4] El-Amir, H., & Hamdy, M. (2020). Data Wrangling and [22] Shen, H., Zhang, M., & Shen, J. (2017). Efficient privacy
Preprocessing. In Deep Learning Pipeline (pp. 147-206). preserving cube-data aggregation scheme for smart grids. IEEE
Apress, Berkeley, CA. Retrieved from: Transactions on Information Forensics and Security, 12(6), 1369-
https://doi.org/10.1007/978-1-4842-5349-6_ 1381. Retrieved from:
[5] García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing https://ieeexplore.ieee.org/document/7828093
in data mining (Vol. 72, pp. 59-139). Cham, Switzerland: [23] Demisse, G. B., Tadesse, T., & Bayissa, Y. (2017). Data Mining
Springer International Publishing. Attribute Selection Approach for Drought Modeling: A Case
[6] Brownlee, J. (2020). Data preparation for machine learning: data Study for Greater Horn of Africa. arXiv preprint
cleaning, feature selection, and data transforms in Python. arXiv:1708.05072. retrieved from:
Machine Learning Mastery. https://arxiv.org/ct?url=https%3A%2F%2Fdx.doi.org%2F10.51
[7] Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data 21%2Fijdkp.2017.7401&v=2a6e454a
collection for machine learning: a big data-ai integration [24] Deepak, J. (n.d.). Numerosity Reduction in Data Mining.
perspective. IEEE Transactions on Knowledge and Data Retrieved on July 25, 2020 from:
Engineering. Retrieved from: https://www.geeksforgeeks.org/numerosity-reduction-in-data
https://ieeexplore.ieee.org/abstract/document/8862913 mining.
[8] Ho, D., Liang, E., Liaw. R., (2019, June 7). 1000x Faster Data [25] Liam, H. (2020, July 20th). 7 Types of Data Bias in Machine
Augmentation. Berkeley Artificial Intelligence Research. Learning. Retrieved on July 31, 2020 from:
Retrieved on July 29, 2020. https://lionbridge.ai/articles/7-types-of-data-bias-in-machine-
[9] Antoniou, A., Storkey, A., & Edwards, H. (2017). Data learning/
augmentation generative adversarial networks. arXiv preprint
arXiv:1711.04340.
[10] Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image
data augmentation for deep learning. Journal of Big Data, 6(1),
60. Retrieved from: https://doi.org/10.1186/s40537-019-0197-0
[11] Murata, K., Noda, H., & Haraguchi, M. (2017). U.S. Patent No.
9,558,151. Washington, DC: U.S. Patent and Trademark Office.
Retrieved from:
https://patents.google.com/patent/US9558151B2/en
[12] Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data
analysis as a foundation of inductive research. Human Resource
Management Review, 27(2), 265-276. Retrieved from:
https://doi.org/10.1016/j.hrmr.2016.08.003
[13] Van der Loo, M., & de Jonge, E. (2017). deductive: Data
Correction and Imputation Using Deductive Methods. R package
version 0.1, 2.
[14] Gupta, S., & Gupta, A. (2019). Dealing with Noise Problem in
Machine Learning Data-sets: A Systematic Review. Procedia
www.ijcat.com 235