L.K.G.
KATARIYA COLLEGE
DATA ANALYTICS [TYBCS SEM VI]
CHAPTER 1. INTRODUCTION TO DATA ANALYTICS [Notes]
1. What do you mean by Data science?
Data science is the process of collecting, analysing, interpreting data to overcome the
patterns, to gain insights and make informed decisions.
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights
hidden within data. It combines math, computer science, and domain expertise to tackle real-
world challenges in a variety of fields.
Data Science processes the raw data and solve business problems and even make prediction
about the future trend or requirement.
For example, from the huge raw data of a company, data science can help answer following
question:
What do customer want?
How can we improve our services?
What will the upcoming trend in sales?
How much stock they need for upcoming festival.
In short, data science empowers the industries to make smarter, faster, and more informed
decisions. To find patterns and achieve such insights, expertise in relevant domain is required.
With expertise in healthcare, a data scientists can predict patient risks and suggest
personalized treatments.
2. Definition of Data analytics
Data Analytics is a systematic approach that transforms raw data into valuable
insights. DATA ANALYTICS is the process of examining, cleaning, transforming, and
modelling raw data to extract useful information, draw conclusions and support
decision making.
data analytics finds extensive application in the finance industry, its utility is not
confined to this sector alone. It is also leveraged in diverse fields such as agriculture,
banking, retail, and government, among others, underscoring its universal relevance
and impact. Thus, data analytics serves as a powerful tool for driving informed
decisions and fostering growth across various industries.
1|Page
3. Roles in data analytics
In Data Analytics for Entry level these roles are available:
Junior Data Analyst
Junior Data Scientist
Associate Data Analyst
In Data Analytics for Experienced level these roles are available:
Data Analyst
Data Architect
Data Engineer
Data Scientist
Marketing Analyst
Business Analyst
2|Page
4. Life cycle of data analytics
The lifecycle of data analytics provides
framework for the best performance of each
phase from the creation of project until
completion.
The data analytics lifestyle is a process that consists
of 6 stages/phases: -
i. DATA DISCOVERY
The data science team learns and investigates
the problem.
Develop context and understanding.
Come to know about data sources needed and
available for the project.
The team formulates the initial hypothesis
that can be later tested with data.
ii. DATA PREPARATION
Steps to explore, preprocess, and condition
data before modelling and analysis.
It requires the presence of an analytic sandbox, the team executes, loads, and
transforms, to get data into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined
order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
iii. MODEL PLANNING
The team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, the data science team develops data sets for training, testing, and
production purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – MATLAB and STASTICA.
iv. MODEL BUILDING
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – MATLAB and STASTICA.
3|Page
v. COMMUNICATION RESULT
After executing model team need to compare outcomes of modelling to criteria
established for success and failure.
Team considers how best to articulate findings and outcomes to various team members
and stakeholders, considering warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
vi. OPERATIONALIZE
2. The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
3. This approach enables team to learn about performance and related constraints of the
model in production environment on small scale which adjust before full deployment.
4. The team delivers final reports, briefings, codes.
5. Free or open-source tools – Octave, WEKA, SQL, Madlib.
5. Advantage and disadvantage of data analytics.
ADVANTAGES:
Data analytics offers advantages that can significantly benefit organizations.
I. Improving efficiency: Helps to analyse large amount of data quickly & displays it in
formulated manner to achieve goals.
It encourages culture of efficiency and teamwork.
By leveraging data analytics, decision-makers can access real-time information, predictive
models, and visualizations that support informed decision-making across all levels of the
organization. This leads to better strategic planning, resource allocation, risk management, and
overall business performance.
II. Improving quality of products and services:
Data analytics helps with enhancing user experience by detecting and correcting errors in tasks.
It Identify customer needs: Data analytics can help businesses understand what their customers
want and need.
Data analytics can help businesses identify gaps in the market and develop new products and
services to meet those needs.
III. Witnessing the opportunities:
Data analytics offers refined sets of data that can help in observing the opportunities to avail.
IV. Helps an organization to make better decisions:
Helps in transforming the data that is available into valuable information for executive so that
better decisions can be made.
4|Page
DISADVANTAGE:
I. Low quality of data: lack of access to quality data.
It is possible that organizations already have access to lot of data, but the question is, do they
have right data they need?
II. privacy concerns: As more data is collected and processed, the risk of data breaches
increases. Collecting and analysing personal data raises privacy concerns.
III. implementation costs: Implementing and maintaining the infrastructure for data analytics
can be expensive. Deploying data analytics tools and systems can be complex and resource
intensive, requiring expertise and investment in technology.
IV. Over reliance on data: Relying solely on data analytics for decision making can overlook
qualitative factors and human judgment, potentially leading to misguided strategies or actions.
6. Difference between Data Analytics and Data Analysis:
S.No. Data Analytics Data Analysis
It is described as a traditional form or generic
1. It is described as a particularized form of analytics.
form of analytics.
It includes several stages like the collection of To process data, firstly raw data is defined in a meaningful
2. data and then the inspection of business data is manner, then data cleaning and conversion are done to get
done. meaningful information from raw data.
It supports decision making by analysing It analyses the data by focusing on insights into business
3.
enterprise data. data.
It uses various tools to process data such as It uses different tools to analyse data such as Rapid Miner,
4.
Tableau, Python, Excel, etc. Open Refine, Node XL, KNIME, etc.
Descriptive analysis cannot be performed on
5. A Descriptive analysis can be performed on this.
this.
One can find anonymous relations with the
6. One cannot find anonymous relations with the help of this.
help of this.
7. It does not deal with inferential analysis. It supports inferential analysis.
5|Page
7.Types of data analytics:
i. Descriptive analytics:
Descriptive analytics looks at data and analyse past event for insight as to how to approach
future events.
It looks at past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past.
Almost all management reporting such as sales, marketing, operations, and finance uses this
type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups.
Unlike a predictive model that focuses on predicting the behaviour of a single
customer, Descriptive analytics identifies many different relationships between customer and
product.
Common examples of Descriptive analytics are company reports that provide historic reviews
like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
ii. Diagnostic analytics:
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem.
We try to find any dependency and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming.
Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
6|Page
iii. Predictive analytics:
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modelling, machine learning , data mining , and game theory that analyse current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modelling
Decision Analysis and optimization
Transaction profiling
iv. Prescriptive analytics:
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
7|Page
8. Mechanistic data analytics:
Mechanistic data analysis is a scientific method used to identify the causal relationships
between two variables. This approach focuses on studying how changes in one variable affect
another variable in a deterministic way.
Mechanistic data analysis is widely used in engineering studies. Engineers use this method to
analyse how changes in a system’s components affect its overall performance and then develop
mathematical models based on their observations to predict the system’s performance under
different conditions. For example, engineers might measure how changes in engine design
parameters such as piston size, fuel injection rate, exhaust pressure, or number of cylinders
affect engine power output or fuel efficiency to help optimize engine designs for improved
performance and efficiency in vehicles.
9. Mathematical model:
A mathematical model in data analytics is a representation of data relationships using
mathematical concepts and equations. It helps analysts understand and predict data, and
make decisions about future events.
i. Occam’s razor:
What is Occam’s razor?
Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must
never be posited without necessity.” Alternatively, as a heuristic, it can be viewed as, when
there are multiple hypotheses to solve a problem, the simpler one is to be preferred. It is not
clear as to whom this principle can be conclusively attributed to, but William of Occam’s (c.
1287 – 1347) preference for simplicity is well documented. Hence this principle goes by the
name, “Occam’s razor.” This often means cutting off or shaving away other possibilities or
explanations, thus “razor” appended to the name of the principle. It should be noted that these
explanations or hypotheses should lead to the same result.
Relevance of Occam’s razor.
There are many events that favour a simpler approach either as an inductive bias or a
constraint to begin with. Some of them are:
Studies like this, where the results have suggested that preschoolers are sensitive to
simpler explanations during their initial years of learning and development.
Preference for a simpler approach and explanations to achieve the same goal is seen in
various facets of sciences; for instance, the parsimony principle applied to
the understanding of evolution.
In theology, ontology, epistemology, etc this view of parsimony is used to derive
various conclusions.
Variants of Occam’s razor are used in knowledge Discovery.
8|Page
ii. Bias-variance trade-offs:
What is Bias?
The bias is known as the difference between the prediction of
the values by the Machine Learning model and the correct
value. Being high in biasing gives a large error in training as
well as testing data. It recommended that an algorithm should
always be low-biased to avoid the problem of underfitting. By
high bias, the data predicted is in a straight-line format, thus
not fitting accurately in the data in the data set. Such fitting is
known as the Underfitting of Data. This happens when
the hypothesis is too simple or linear in nature. Refer to the
graph given below for an example of such a situation.
What is Variance?
The variability of model prediction for a given data point
which tells us the spread of our data is called the variance
of the model. The model with high variance has a very
complex fit to the training data and thus is not able to fit
accurately on the data which it has not seen before. As a
result, such models perform very well on training data but
have high error rates on test data. When a model is high
on variance, it is then said to as Overfitting of Data.
Overfitting is fitting the training set accurately via
complex curve and high order hypothesis but is not the
solution as the error with unseen data is high. While
training a data model variance should be kept low. The
high variance data looks as follows.
Bias Variance Trade-off
If the algorithm is too simple (hypothesis with linear
equation) then it may be on high bias and low variance
condition and thus is error-prone. If algorithms fit too
complex (hypothesis with high degree equation) then it
may be on high variance and low bias. In the latter
condition, the new entries will not perform well. Well,
there is something between both conditions, known as a
Trade-off or Bias Variance Trade-off. This trade off in
complexity is why there is a trade-off between bias and
variance. An algorithm can’t be more complex and less
complex at the same time. For the graph, the perfect trade-
off will be like this.
9|Page
10.Taxonomy model
Data taxonomy is a way of organizing and classifying data to create a structured hierarchy. It
helps businesses categorize their data to access and use it easily.
Information is grouped according to its characteristics, attributes, and relationships and placed
into categories and subcategories.
There are typically multiple levels or layers in a data taxonomy, each level representing a
specific category or class. Top-level categories are broader, while lower levels are more
granular. Organizations can custom-build their taxonomy structure based on their needs and
the nature of the data.
You can apply taxonomy to various data types, including structured data (databases and
spreadsheets) and unstructured data (documents and multimedia files).
To help understand how data taxonomy works, let’s consider an example for an
ecommerce company.
Top-level categories may contain:
Products
Customers
Orders
Marketing
Inventory
11.Baseline model:
A baseline model in data analytics is a simple model that serves as a reference point for
more complex models. It's used to evaluate the performance of advanced models and to
determine if they're better.
Why use a baseline model?
Baseline models help data scientists understand how complex models will perform.
They can identify issues with data quality, algorithms, features, or hyperparameters.
They can help determine if a more complex model is necessary.
Importance of Baseline Models
Overfitting
They serve as a basis for further model creation
Simplify the model development process
Identify data quality issues
A benchmark for model efficiency
Baseline model for classification
A baseline model in classification is a simple model used as a reference point to evaluate the
performance of more complex models. It helps to establish a minimum level of performance
that any advanced model should exceed.
10 | P a g e
Here are some common approaches to creating a baseline model for classification tasks:
1. Uniform or random selection amongst labels
2. The most common label appearing in training data
3. the most accurate signal feature model
4.Majority Class Classifier
Baseline model for value prediction
I. Mean or Median
Mean Predictor: This model predicts the average value of the target variable across the
training dataset.
Median Predictor: This model predicts the median value of the target variable, which is
less sensitive to outliers.
II. Linear Regression
A simple linear regression model predicts the target variable as a linear combination
of the input features.
Suitable for datasets where there is a linear relationship between the features and the
target variable.
Linear regression is a data analysis technique that uses a known data value to predict
an unknown value. It is used to model the relationship between two or more variables
as a linear equation
III. Value of the previous point in time
This model predicts the target variable based on the value from the previous time
point. This is particularly useful in time series forecasting.
Ideal for time series data where the previous value is a good predictor of the current
value.
11 | P a g e
12.Model evaluation
I. Metrix for evaluating classifiers:
A matrix refers to set of numbers or objects arranged in rows
and columns.
The advantage of metrices is that it can often simplify
representing larger amount of data or relationship.
-Evaluating a classifier means measuring how accurately our
predicted labels matches
the gold standard labels in the evaluation set.
-For the common case of two distinct labels or classes (binary
classifications typically call the smaller and more interesting
of the two classes as positive and the larger/other class as
negative.
-In a spam classification problem, the spam would typically be
positive and the
(non-spam) would be negative.
This labelling aims to ensure that identifying the positives is at
least as hard for
identifying the negatives, although often the test instances are selected so that the
classes are of equal cardinality.
There are four possible results of what the classification model could do on any given
instance, which defines the confusion matrix or contingency table shown in Figure.
A confusion matrix contains information about actual and predicted classification
done by a classifier.
A confusion matrix is a table that is often used to describe the performance
12 | P a g e
classification model (or "classifier") on a set of test data for which the true value is
known.
The confusion matrix itself is relatively simple to understand, but the The relative
terminology can be confusing.
A confusion matrix also known as an error matrix.
A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
A confusion matrix is nothing but a table with two dimensions viz. "Actual"
"Predicted" and furthermore, both the dimensions have "True Positives (TP)" and
Negatives (TN)", "False Positives (FP)", "False Negatives (FN)".
True Positive:
Interpretation: You predicted positive and it is true.
You predicted that a woman is pregnant and she actually is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually
is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is
not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.
Just Remember, we describe predicted values as Positive and Negative and actual values as
True and False.
II. Statistic measures for classifiers:
13 | P a g e
13.Class imbalance
Class imbalance (CI) in classification problems arises when the number of observations
belonging to one class is lower than the other. Ensemble learning combines multiple models
to obtain a robust model and has been prominently used with data augmentation methods to
address class imbalance problems.
Problem with Handling Imbalanced Data for Classification
Algorithms may get biased towards the majority class and thus tend to predict output
as the majority class.
Minority class observations look like noise to the model and are ignored by the model.
Imbalanced dataset gives misleading accuracy score.
14. What is the AUC-ROC curve?
The AUC-ROC curve, or Area Under the Receiver Operating Characteristic curve, is a
graphical representation of the performance of a binary classification model at various
classification thresholds. It is commonly used in machine learning to
assess the ability of a model to distinguish between two classes, typically
the positive class (e.g., presence of a disease) and the negative class (e.g.,
absence of a disease).
Let us first understand the meaning of the two terms ROC and AUC.
ROC: Receiver Operating Characteristics
AUC: Area Under Curve
Receiver Operating Characteristics (ROC) Curve
ROC stands for Receiver Operating Characteristics, and the ROC curve
is the graphical representation of the effectiveness of the binary classification model. It plots
the true positive rate (TPR) vs the false positive rate (FPR) at different classification thresholds.
Area Under Curve (AUC) Curve:
AUC stands for the Area Under the Curve, and the AUC curve represents the area under the
ROC curve. It measures the overall performance of the binary classification model. As both
TPR and FPR range between 0 to 1, So, the area will always lie between 0 and 1, and A greater
value of AUC denotes better model performance. Our main goal is to maximize this area to
have the highest TPR and lowest FPR at the given threshold. The AUC measures the probability
that the model will assign a randomly chosen positive instance a higher predicted probability
compared to a randomly chosen negative instance.
14 | P a g e
It represents the probability with which our model can distinguish between the two classes
present in our target.
Key terms used in AUC and ROC Curve
1. TPR and FPR
This is the most common definition that you would have encountered when you would Google
AUC-ROC. Basically, the ROC curve is a graph that shows the performance of a classification
model at all possible thresholds (threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two parameters
TPR – True Positive Rate
FPR – False Positive Rate
Before understanding, TPR and FPR let us quickly look at the confusion matrix.
True Positive: Actual Positive and Predicted as
Positive
True Negative: Actual Negative and Predicted as
Negative
False Positive (Type I Error): Actual Negative but
predicted as Positive
False Negative (Type II Error): Actual Positive but
predicted as Negative
In simple terms, you can call False Positive a false
alarm and False Negative a miss. Now let us look at what
TPR and FPR are.
2. Sensitivity / True Positive Rate / Recall
Basically, TPR/Recall/Sensitivity is the ratio of positive examples that are correctly
identified. It represents the ability of the model to correctly identify positive instances and is
calculated as follows:
TPR= TP
TP + FN
Sensitivity/Recall/TPR measures the proportion of actual positive instances that are correctly
identified by the model as positive.
3. False Positive Rate
FPR is the ratio of negative examples that are incorrectly classified.
FPR= FP
FP+TN
15 | P a g e
4. Specificity
Specificity measures the proportion of actual negative
instances that are correctly identified by the model as negative.
It represents the ability of the model to correctly identify
negative instances
Specificity= TN
TN+FP
=1−FPR
And as said earlier ROC is nothing but the plot between TPR
and FPR across all possible thresholds and AUC is the entire
area beneath this ROC curve.
-------------------********************----------------------------------****************----------------
16 | P a g e