Chapter 4
Chapter 4
All rights reserved. This book or any portion thereof may not be re-
produced or used in any manner whatsoever without the publisher's
express written permission except for the use of brief quotations in a
book review or scholarly journal.
CHAPTER FOUR:
DATA SCIENCE APPLICATIONS
2
A BEGINNER’S GUIDE TO DATA SCIENCE
3
ENAMUL HAQUE
Statistical sample
Statistics refer to the entire set of raw data that you might have avail-
able for a test or experiment as a population. For several reasons, you can't
necessarily measure patterns and trends across the population. For this rea-
son, we can use statistics to sample, perform some calculations for this data
4
A BEGINNER’S GUIDE TO DATA SCIENCE
Descriptive statistics
Descriptive statistics help us, as the name suggests, to describe the
data. In other words, it allows us to understand the underlying character-
istics. It does not predict anything, makes no assumptions or completes
nothing. It just describes what the data sample we have looks like.
Descriptive statistics are derived from calculations, often referred to
as parameters. These include things like:
Mean - the central value, commonly referred to as average.
Median - the average if we have ordered the data from low to high
and divided precisely in half.
Mode - the most common value.
Descriptive statistics are helpful but can often hide important infor-
mation about the record. For example, suppose a document contains mul-
tiple numbers that are much larger than the others. In that case, the mean
may be distorted and does not accurately represent the data.
A distribution is a chart, often a histogram, that shows the number of
times each value is displayed in a record. This type of chart gives us infor-
mation about the distribution and skewness of the data.
5
ENAMUL HAQUE
Probability
The probability, in simple terms, is the likelihood of an event occur-
ring. In statistics, an event results from an experiment that can be some-
thing like dice or an AB test result.
The probability for a single event is calculated by dividing the number
of events by the total number of possible results. For example, if you throw
a six on a dice, there are 6 possible results. So, the chance of dicing a six is
1/6 = 0.167; sometimes, it is expressed as a percentage, i.e., 16.7%.
Events can be either independent or dependent. For dependent
events, a previous event affects the subsequent event. Suppose we have a
bag of M & Ms and wanted to determine the probability that a red M &
M will be selected at random. Each time we remove the selected M & M
from the bag, the likelihood of picking red changes due to previous events'
effects.
Independent events are not affected by previous events. In the M &
M bag case, we put it back in the bag every time we select one. The proba-
bility of choosing red remains the same each time.
Whether an event is independent or not is important because we cal-
culate the probability of multiple events changes depending on the type.
The probability of multiple independent events is calculated by
simply multiplying the probability of each event. Suppose we wanted to
calculate the probability of dicing 6 three times in the example of the dice's
roll. This would look like this:
1/6 = 0.167 1/6 = 0.167 1/6 = 0.167
0.167 * 0.167 * 0.167 = 0.005
The calculation is different for dependent events, also known as con-
ditional probability. If we take the example of M & M, let's imagine we
6
A BEGINNER’S GUIDE TO DATA SCIENCE
have a bag with only two colours, red and yellow, and we know that the
pack contains 3 red and 2 yellow, and we want to calculate the probability
of selecting two red wines in a row. In the first selection, the probability of
making a red selection is 3/5 = 0.6. We removed an M & M that was ran-
domly red in the second selection, so our second probability calculation is
2/4 = 0.5. Therefore, the probability of picking two reds in a row is 0.6 *
0.5 = 0.3.
Bias
As explained in the statistics, we often use data samples to estimate
the entire data set. Similarly, we will use some training data for predictive
modelling and create a model that can make predictions about new data.
Bias is the tendency of a statistical or predictive model to underesti-
mates a parameter. This is often due to the method of obtaining a sample
or the way errors are measured. There are different types of distortions in
statistics. Here is a brief description of two of them.
Selection Distortion - This occurs when the sample is not randomly
selected. An example of data science can be to stop an AB test prematurely
when the test runs or choose data to train a machine learning model from
a specific period of time, mask seasonal effects.
Confirmation Distortion - This occurs when the person perform-
ing an analysis has a predetermined assumption about the data. In this sit-
uation, there may be a tendency to spend more time studying variables that
are likely to support this assumption.
As explained earlier, the mean in a data sample is the central value.
The variance measures how far each value in the record is from the mean.
Essentially, it is a measurement of the variation of numbers in a data set.
The standard deviation is a common measure of the variation of data
with the normal distribution. It is a calculation that specifies a value that
indicates how far the values are distributed. A low standard deviation in-
dicates that the values tend to be reasonably close to the mean, while a high
standard deviation indicates that the values are more distributed.
7
ENAMUL HAQUE
8
A BEGINNER’S GUIDE TO DATA SCIENCE
Correlation
Correlation is a statistical technique used to measure relationships be-
tween two variables. The correlation is assumed to be linear (it forms a line
when displayed in a chart) and is expressed as a number between +1 and -
1. This is called a correlation coefficient.
A correlation coefficient of +1 denotes an entirely positive correlation
(if the value for one variable also increases the value of the second variable),
a coefficient of 0 does not mean correlation, and a coefficient of -1 denotes
an entirely negative correlation.
Statistics is a wide and complex field. This article is intended as a brief
introduction to some of the most commonly used statistical techniques in
data science. Data science courses often require prior knowledge of these
basic concepts or start with descriptions that are too complex and difficult
to understand. I hope this article will serve as a refresher for a selection of
basic statistical techniques used in data science before going into more ad-
vanced topics.
9
ENAMUL HAQUE
Model evaluation metrics are used to assess the goodness of fit between
model and data, compare different models in the context of model selec-
tion, and predict how predictions (associated with a specific model and
data set) are expected to be accurate.2
Confidence interval
Modern definitions of variance have several desirable properties.
Confidence intervals are used to assess how reliable and statistical estimate
is. Wide confidence intervals mean that your model is flawed (and it is
worth investigating other models) or that your data is very noisy if confi-
dence intervals don’t improve by changing the model (that is, testing a dif-
ferent theoretical statistical distribution for your observations). Modern
confidence intervals are model-free, data-driven. A more general frame-
work to assess and reduce sources of variance is called the analysis of vari-
ance.
Confusion matrix
Used in the context of clustering. These N x N matrices (where N is
the number of clusters) are designed as followed: the element in the cell (i,
j) represents the number of observations in the test training set (as opposed
to the control training set, in a cross-validation setting) that belong to clus-
ter i and are assigned (by the clustering algorithm) to cluster j. When these
numbers are transformed into proportions, these matrices are sometimes
called contingency tables. A wrongly assigned observation is called false
10
A BEGINNER’S GUIDE TO DATA SCIENCE
Kolmogorov-Smirnov Chart.
This non-parametric statistical test compares two distributions to as-
sess how close they are to each other. In this context, one of the distribu-
tions is the theoretical distribution that the observations are supposed to
follow (usually a continuous distribution with one or two parameters,
such as Gaussian law), while the other distribution is the actual, empirical,
parameter-free, discrete distribution computed on the observations.
Chi Square
It is another statistical test similar to Kolmogorov-Smirnov, but in
this case, it is a parametric test. It requires you to aggregate observations in
a number of buckets or bins, each with at least 10 observations.
ROC curve
Unlike the lift chart, the ROC curve is almost independent of the re-
sponse rate. The receiver operating characteristic (ROC), or ROC curve,
is a graphical plot that illustrates a binary classifier system's performance as
its discrimination threshold is varied. The curve is created by plotting the
11
ENAMUL HAQUE
true positive rate (TPR) against the false positive rate (FPR) at various
threshold settings. The true-positive rate is also known as sensitivity or the
sensitivity index d’, known as “d-prime” in signal detection and biomedical
informatics, or recall in machine learning. The false-positive rate is also
known as the fall-out and can be calculated as (1 — specificity). The ROC
curve is thus the sensitivity as a function of fall-out.
Gini Coefficient
The Gini coefficient is sometimes used in classification problems.
Gini = 2*AUC-1, where AUC is the area under the curve (see the ROC
curve entry above). A Gini ratio above 60% corresponds to a good model.
Not to be confused with the Gini index or Gini impurity, used when
building decision trees.
12
A BEGINNER’S GUIDE TO DATA SCIENCE
Cross-Validation
This is a general framework to assess how a model will perform in the
future; it is also used for model selection. It consists of splitting your train-
ing set into test and control data sets, training your algorithm (classifier or
predictive algorithm) on the control data set, and testing it on the test data
set. Since the actual values are known on the test data set, you can compare
them with your predicted values using one of the other comparison tools
mentioned in this article. Usually, the test data set itself is split into multi-
ple subsets or data bins to compute confidence intervals for predicted val-
ues. The test data set must be carefully selected and must include different
time frames and various types of observations (compared with the control
data set), each with enough data points, in order to get sound, reliable con-
clusions as to how the model will perform on future data, or on data that
has slightly involved. Another idea is to introduce noise in the test data set
and see how it impacts prediction: this is referred to as model sensitivity
analysis.
Predictive Power
It is related to the concept of entropy or the Gini index mentioned
above. It was designed as a synthetic metric satisfying, interesting proper-
ties and used to select a good subset of features in any machine learning
project or as a criterion to decide which node to split at each iteration when
building decision trees.
13
ENAMUL HAQUE
14
A BEGINNER’S GUIDE TO DATA SCIENCE
in an online store. They will then use their knowledge of professional data
analysis techniques to solve the problem.
The tasks that a data analyst solves depend on the industry in which
they work. Governments use data analysis for applications such as protect-
ing public health and predicting changes in the economy. On the other
hand, companies use data analysis for everything, from analysing your app
experience to figuring out which features users like best on the website.
Encoding
To become a successful data analyst, you need to be able to program.
This is because data analysis is very individual work. Each data set will be
15
ENAMUL HAQUE
Statistical analysis
Data analysts use solutions for the statistical analysis of datasets. This
includes setting the limits of the data set. Use statistical principles, such as
probabilities, to understand the data set and calculate the final results us-
ing the same principles.
Data visualisation
Data analysts are responsible for creating visual effects that represent
what they discovered after the analysis. This is an important part of the job
because data analysts usually answer questions from people with no data
analysis experience.
Data analysts should be able to report their findings to other people
without technical education. A great way to do this is to use graphs that
are easier to interpret than lists of numbers. Data analytics tools, such as
matplotlib3 and Tableau, allow data analysts to create graphics and visuals
for their work.
16
A BEGINNER’S GUIDE TO DATA SCIENCE
Data storytelling
Data storytelling takes data visualisations to the next level — data sto-
rytelling refers to “how” you communicate your insights. Think of it as a
picture book. A good picture book has good visuals, but it also has an en-
gaging and powerful narrative that connects the visuals.
17
ENAMUL HAQUE
Business analytics
While it's easy to think that data analysts are just sitting around ana-
lysing data, they need to do so in the context of a much more serious prob-
lem. Data analysts need to be well aware of business goals and how data
can help them achieve them.
Data analysts work with people across the organisation every day to
solve problems. This means that they need to know how to speak the lan-
guage that engineers, directors, sellers, and other employees understand.
As a result, the problems are as follows: "How can this help achieve
our organisation's goals?" These people are called business analysts. They
use diagnostic analysis to solve business problems.
Interpretation of data
Data analysts should be able to interpret the data. Not only do you
need to know what the dataset can tell you, but it's also important that you
can understand what it's telling you. It's useless to just know what data
exists. You need to know what this data is talking about.
18
A BEGINNER’S GUIDE TO DATA SCIENCE
After the analysis, the data analyst will read the data he works with to
determine trends. These trends will be included in the final report along
with any visualisations and graphs prepared by the analyst.
Data analysis is critical to our modern economy. Today, data analysis
is used by the insurance industry to predict insurance cases, the financial
industry to predict the direction of the stock market, technology compa-
nies to analyse interactions with users.
Moreover, even the government relies on data to solve some prob-
lems. This is because data can help an organisation make a more informed
and data-based problem. When you have data to back up a solution, it's
easier to be sure you're on the right track.
Typically, data analysts use their knowledge of mathematics, dataset
statistical analysis, and programming to solve business problems.
19
ENAMUL HAQUE
Data Cleanup
When data scientists talk about "cleaning up" data, it's hard to interpret
it literally. This is reasonable because data scientists do not clean up the
data. Cleaning up the data is to make a valuable dataset by removing and
changing erroneous or irrelevant values.
20
A BEGINNER’S GUIDE TO DATA SCIENCE
questions using data. If a data scientist works with incorrect data, their
conclusion is unlikely to be accurate.
What's more, cleaning up the data helps save time in the future.
Cleaning up the data precedes the analysis. This means that the data scien-
tist analyses the data, and long before he draws any conclusions by the
time, he draws any conclusions. Their dataset will be prepared exactly as
they want.
Having a clean data set means that the data scientist can move forward
to the analysis. Knowing that he doesn't have to go back and fix incorrectly
formatted or remove inaccurate values. Ultimately, the data scientist wants
their dataset to make sense and include all the data. Necessary to draw a
reasonable conclusion on the issue.
21
ENAMUL HAQUE
22
A BEGINNER’S GUIDE TO DATA SCIENCE
Removing redundant data ensures that the findings are based on the
right values. If there is to be repetitive data in the dataset, the data may
deviate from one output over another. This will have a significant impact
on the accuracy of the final conclusions.
23
ENAMUL HAQUE
24
A BEGINNER’S GUIDE TO DATA SCIENCE
Focus
Data mining is usually used as part of the business analysis process.
Typically, data mining is not used outside the business environment be-
cause it is explicitly designed to help companies collect and understand
their data. On the other hand, data science is a scientific study. Data scien-
tists use this study, among other things, to create predictive models, con-
duct experiments and social analysis.
25
ENAMUL HAQUE
Data type
As a rule, data mining focuses only on structured data, although un-
structured data can be used. For data scientists, using structured, unstruc-
tured and semi-structured data is common. Data mining is a little easier in
this aspect because professionals cannot know how to work with all types
of data, while data experts will probably need to know all types.
Goal
Data mining's primary goal is to make business data easy to under-
stand and therefore available for use. Data science aims to achieve scientific
advances and create data-driven products for use by different organisa-
tions. In general, data mining has a much more specific purpose than data
science. The whole purpose of data mining is to study and organise com-
pany data and identify previously unknown trends.
26
A BEGINNER’S GUIDE TO DATA SCIENCE
According to Gartner, in 2018, just over 137,000 driverless cars were pro-
duced, and in 2019 - more than 330,000. Let us explore the basic concepts
to help you navigate the topic and understand how data science makes a
whole new chapter with this technology. This simulates the human brain
and its cognitive networks, which works as a basis for a self-driving car.
Data scientists are the pioneers behind perfecting the brain of the
beast (driverless cars). We must somehow figure out how to develop algo-
rithms that master Perception, Localisation, Prediction, Planning, and
Control.5
“Perception merges several different sensors to know where the road
is and what is the state (type, position, speed) of each obstacle. Localisation
uses precise maps and sensors to understand where the car is in its environ-
ment at the centimetre level. Prediction allows the car to anticipate the be-
haviour of objects in its surrounding. Planning uses the knowledge of the
car’s position and obstacles to planning routes to a destination. The appli-
cation of the law is coded here, and the algorithms define waypoints. Con-
trol is to develop algorithms to follow the waypoints efficiently.”6
Radars
Radars use radio waves to determine the distance to objects and the
trajectory of their movement. As a rule, the radars on the drone are four
pieces. The pulses they emit are reflected from objects, even if they are far
away, and are transmitted to the receiving antenna. Thanks to radars, the
system can instantly respond to changes in space.
27
ENAMUL HAQUE
Lidar
Lidar on the principle of action is similar to radar, but instead of radio
waves uses laser beams. Today it is the most accurate tool for measuring
the distance to objects - from a couple of centimetres to hundreds of me-
ters - and their recognition. Lidar is installed on the car's roof or, if there
are several lidars, along its perimeter. The device scans the space and creates
a 3D map of the area.
Sensors
Sensors are the common name of lidars and radars. They scan the traf-
fic scene around the car and avoid accidents. The driverless car can have as
many as you want. For example, Roborace's first Robocar drone race car
is equipped with 18 sensors.
Position sensor
The position sensor is a device that determines the drone's position
on the map up to its coordinates.
28
A BEGINNER’S GUIDE TO DATA SCIENCE
Video Camera
The video camera is needed to distinguish traffic lights' colours to rec-
ognise road signs, markings, and people.
Computer
The computer is in the trunk of a driverless car and, in real-time, anal-
yses all the data that comes from sensors and sensors. The power of the
computer allows it to process a vast array of information.
Maps
High-precision maps allow drones to drive even on roads that do not
have markings. They are needed in order for sensors and sensors to react
only to changes in the situation on the ground. Otherwise, constant scan-
ning of the surrounding space requires huge computing power. In addi-
tion, thanks to the maps, the car "understands" what is behind the turn -
cameras and sensors cannot.
29
ENAMUL HAQUE
30
A BEGINNER’S GUIDE TO DATA SCIENCE
31
ENAMUL HAQUE
32
A BEGINNER’S GUIDE TO DATA SCIENCE
33
ENAMUL HAQUE
34
A BEGINNER’S GUIDE TO DATA SCIENCE
35
ENAMUL HAQUE
36
A BEGINNER’S GUIDE TO DATA SCIENCE
37
ENAMUL HAQUE
38
A BEGINNER’S GUIDE TO DATA SCIENCE
1
Rebecca Vickery - 8 Fundamental Statistical Concepts for Data Science -
https://towardsdatascience.com/8-fundamental-statistical-concepts-for-data-science-
9b4e8a0c6f1c
2
L.V - 11 Important Model Evaluation Techniques Everyone Should Know -
https://www.datasciencecentral.com/profiles/blogs/7-important-model-evaluation-er-
ror-metrics-everyone-should-know
3
Matplotlib is a plotting library for the Python programming language and its nu-
merical mathematics extension NumPy. It provides an object-oriented API for embed-
ding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython,
Qt, or GTK+. [Wikipedia]
4
Saira Tabassum - Data Mining vs Data Science: The Key Differences for Data
Analysts - https://careerkarma.com/blog/data-mining-vs-data-science/
5
Fei Qi - The Data Science Behind Self-Driving Cars - https://me-
dium.com/@feiqi9047/the-data-science-behind-self-driving-cars-eb7d0579c80b
6
Jeremy Cohen - AI & Self-Driving Car Engineer —I teach people how to join the
Autonomous Tech world! https://www.thinkautonomous.ai
39