STATISTICS AND MACHINE LEARNING
Q1) What is correlation and covariance in statistics?
Answer: Covariance and Correlation are two mathematical concepts; these two approaches
are widely used in statistics. Both Correlation and Covariance establish the relationship and
also measure the dependency between two random variables. Though the work is similar
between these two in mathematical terms, they are different from each other.
Correlation: Correlation is considered or described as the best technique for measuring and
also for estimating the quantitative relationship between two variables. Correlation measures
how strongly two variables are related.
Covariance: In covariance two items vary together and it’s a measure that indicates the
extent to which two random variables change in cycle. It is a statistical term; it explains the
systematic relation between a pair of random variables, wherein changes in one variable
reciprocal by a corresponding change in another variable.
Q2) What is P- value and explain it?
Answer: When we execute a hypothesis test in statistics, a p-value helps us in determine the
significance of our results. These Hypothesis tests are nothing but to test the validity of a
claim that is made about a population. A null hypothesis is a situation when the hypothesis
and the specified population is with no significant difference due to sampling or experimental
error.
Q3) How is KNN different from k-means clustering?
Answer: K-Nearest Neighbours is a supervised classification algorithm, while k-means
clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar
at first, what this really means is that in order for K-Nearest Neighbours to work, you need
labelled data you want to classify an unlabeled point into (thus the nearest neighbour part). K-
means clustering requires only a set of unlabeled points and a threshold: the algorithm will
take unlabeled points and gradually learn how to cluster them into groups by computing the
mean of the distance between different points.
The critical difference here is that KNN needs labelled points and is thus supervised learning,
while k-means doesn’t — and is thus unsupervised learning.
Q4) Difference between Type1 and Type 2 error?
Type 1 - Reject the null hypothesis even when it’s true. Also called False Positive error. It’s a
False alarm, given a condition has fulfilled even when it's not.
Type 2 - Accept the null hypothesis even when it’s false.Also called False Negative.
Q5) Is rotation necessary in PCA?
Rotation (orthogonal) is necessary because it maximizes the difference between variance
captured by the component. This makes the components easier to interpret. Not to forget,
that’s the motive of doing PCA where, we aim to select fewer components (than features)
which can explain the maximum variance in the data set. By doing rotation, the relative
location of the components doesn’t change, it only changes the actual coordinates of the
points.
If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select
more number of components to explain variance in the data set.
Q6) Explain what precision and recall are. How do they relate to the ROC curve?
Answer: Recall describes what percentage of true positives are described as positive by the
model. Precision describes what percent of positive predictions were correct. The ROC curve
shows the relationship between model recall and specificity–specificity being a measure of
the percent of true negatives being described as negative by the model. Recall, precision, and
the ROC are measures used to identify how useful a given classification model is.
Q7) What is root cause analysis?
Answer: All of us dread that meeting where the boss asks, 'why is revenue down?' The only
thing worse than that question is not having any answers! There are many changes happening
in your business every day, and, often, you will want to understand exactly what is driving a
given change — especially if it is unexpected. Understanding the underlying causes of
change is known as root cause analysis.
Q8) What is selection bias?
Answer: Selection bias occurs in an “active,” sense when the sample data that is gathered and
prepared for modelling has characteristics that are not representative of the true, future
population of cases the model will see. That is, active selection bias occurs when a subset of
the data are systematically (i.e., non-randomly) excluded from analysis.
Q9) What is sampling?
Answer: Data sampling is a statistical analysis technique used to select, manipulate and
analyze a representative subset of data points to identify patterns and trends in the larger data
set being examined. It enables data scientists, predictive modelers and other data analysts to
work with a small, manageable amount of data about a statistical population to build and run
analytical models more quickly, while still producing accurate findings.
Q10) What is Linear Regression?
Answer: Linear regression is a statistical technique where the score of a variable Y is
predicted from the score of a second variable X. X is referred to as the predictor variable and
Y as the criterion variable.
Q11) What is Interpolation and Extrapolation?
Answer: Estimating a value from 2 known values from a list of values is Interpolation.
Extrapolation is approximating a value by extending a known set of values or facts.
Q12) What is power analysis?
Answer: An experimental design technique for determining the effect of a given sample size.
Q13) What is Principle Component Analysis (PCA).
Answer: PCA is a method for transforming features in a dataset by combining them
into uncorrelated linear combinations. These new features, or principal components,
sequentially maximize the variance represented (i.e. the first principal component has
the most variance, the second principal component has the second most, and so on).As a
result, PCA is useful for dimensionality reduction because you can set an arbitrary
variance cutoff.
Q14) What is the ROC Curve and what is AUC (a.k.a. AUROC)?
Answer: The ROC (receiver operating characteristic) the performance plot for binary
classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x-
axis).AUC is area under the ROC curve, and it's a common performance metric for
evaluating binary classification models. It’s equivalent to the expected probability that a
uniformly drawn random positive is ranked before a uniformly drawn random
negative.
Q15) What is bagging?
Answer: Bagging, or Bootstrap Aggregating, is an ensemble method in which the
dataset is first divided into multiple subsets through resampling. Then, each subset is
used to train a model, and the final predictions are made through voting or averaging
the component models. Bagging is performed in parallel.
Q16) You are given a data set consisting of variables having more than 30% missing
values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%.
How will you deal with them?
● Assign a unique category to the missing values, who knows the missing values might
uncover some trend.
● We can remove them blatantly.
● Or, we can sensibly check their distribution with the target variable, and if found any
pattern we’ll keep those missing values and assign them a new category while
removing others.
Q17) How do you map nicknames (Pete, Andy, Nick, Rob, etc.) to real names?
This problem can be solved in n number of ways. Let’s assume that you’re given a data set
containing 1000s of twitter interactions. You will begin by studying the relationship between
two people by carefully analysing the words used in the tweets.
● This kind of problem statement can be solved by implementing Text Mining using
Natural Language Processing techniques, wherein each word in a sentence is broken
down and co-relations between various words are found.
● NLP is actively used in understanding customer feedback, performing sentimental
analysis on Twitter and Facebook. Thus, one of the ways to solve this problem is
through Text Mining and Natural Language Processing techniques.
Q18. Suppose you found that your model is suffering from low bias and high variance.
Which algorithm you think could tackle this situation and Why?
Type 1: How to tackle high variance?
● Low bias occurs when the model’s predicted values are near to actual values.
● In this case, we can use the bagging algorithm (e.g.: Random Forest) to tackle high
variance problem.
● Bagging algorithm will divide the data set into its subsets with repeated randomized
sampling.
● Once divided, these samples can be used to generate a set of models using a single
learning algorithm. Later, the model predictions are combined using voting
(classification) or averaging (regression).
Type 2: How to tackle high variance?
● Lower the model complexity by using regularization technique, where higher model
coefficients get penalized.
● You can also use top n features from variable importance chart. It might be possible
that with all the variable in the data set, the algorithm is facing difficulty in finding
the meaningful signal.
Q19) Explain about statistics branches?about statistics
Answer: The two main branches of statistics are descriptive statistics and inferential
statistics.
Descriptive statistics: Descriptive statistics summarizes the data from a sample using
indexes such as mean or standard deviation.
Descriptive Statistics, methods include displaying, organizing and describing the data.
Inferential Statistics: Inferential Statistics draws the conclusions from data that are subject
to random variation such as observation errors and sample variation.
Q20) What is a linear regression in statistics?
Answer: Linear regression is one of the statistical techniques used in a predictive analysis, in
this technique will identify the strength of the impact that the independent variables show on
deepened variables.
Q21) What is P- value and explain it?
When we execute a hypothesis test in statistics, a p-value helps us in determine the
significance of our results. These Hypothesis tests are nothing but to test the validity of a
claim that is made about a population. A null hypothesis is a situation when the hypothesis
and the specified population is with no significant difference due to sampling or experimental
error.
Q22) What is Data Science and what is the relationship between Data science and
Statistics? Data Science and what is the relationship between Data
Answer: Data Science is simply data-driven science, it involves the interdisciplinary field of
automated scientific methods, algorithms, systems, and process to extracts the insights and
knowledge from data in any form, either structured or unstructured. Data Science and Data
mining have similarities, both abstracts the useful information from data.
Data Sciences include Mathematical Statistics along with Computer science and
Applications. By combing aspects of statistics, visualization, applied mathematics, computer
science Data Science is turning the vast amount of data into insights and knowledge.
Statistics is one of the main components of the Data Science. Statistics is a branch of
mathematics commerce with the collection, analysis, interpretation, organization, and
presentation of data.
Q23) What is a Sample in Statistics and list the sampling methods? Statistics and list the
sampling methods?
Answer: In a Statistical study, a Sample is nothing but a set of or a portion of collected or
processed data from a statistical population by a structured and defined procedure and the
elements within the sample are known as a sample point.
Below are the 4 sampling methods:
● Cluster Sampling: IN cluster sampling method the population will be divided into
groups or clusters.
● Simple Random: This sampling method simply follows the pure random division.
● Stratified: In stratified sampling, the data will be divided into groups or strata.
● Systematical: Systematical sampling method picks every kth member of the
population.
Q24) How is a decision tree pruned?
Answer : Pruning is what happens in decision trees when branches that have weak predictive
power are removed in order to reduce the complexity of the model and increase the predictive
accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with
approaches such as reduced error pruning and cost complexity pruning.
Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t
decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes
pretty close to an approach that would optimize for maximum accuracy.
Q25) What is statistical data?
Statistical data are the facts which are collected for the purpose of investigation. There are
two types of statistical data:
(i) Primary data: The data collected by an investigator for the first time for his own
purpose are called primary data. As the primary data are collected by the user of the data, so
it is more reliable and relevant.
(ii) Secondary data: The data collected by a secondary source and used by the investigator
for his purpose is called secondary data. For example score of a cricket match noted from
newspapers is secondary data.
Thus data which are primary in the hands of one become secondary in the hands of the other.
Data collected by any source also can be divided in following two types:
(i) Raw Data: Raw data are those data which are obtained from the original source but not
arranged numerically.
An ‘array’ is an arrangement of raw numerical data in the ascending or descending order of
magnitude. Above data can be written as
25, 32, 35, 40, 55, 62, 75, 79, 89, 96
(ii) Grouped data: An array can be placed systematically in groups or categories. For
example the above data can be grouped in following manner.
Q26) What is Hypothesis Testing?
The theory, methods, and practice of testing a hypothesis by comparing it with the null
hypothesis. The null hypothesis is only rejected if its probability falls below a predetermined
significance level, in which case the hypothesis being tested is said to have that level of
significance.
Q27) What is Bivariate Distribution?
A bivariate distribution, is the probability that a certain event will occur when there are two
independent random variables in your scenario. For example, having two bowls, each filled
with two different types of candies, and pulling one candy from each bowl gives you two
independent random variables, the two different candies. Since you are pulling one candy
from each bowl at the same time, you have a bivariate distribution when calculating your
probability of ending up with particular kinds of candies.
Q28) How is the k-nearest neighbor algorithm different from k-means clustering?
Ans: K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering
is an unsupervised clustering algorithm. While the mechanisms may seem similar at first,
what this really means is that in order for K-Nearest Neighbors to work, you need labeled
data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means
clustering requires only a set of unlabeled points and a threshold: the algorithm will take
unlabeled points and gradually learn how to cluster them into groups by computing the mean
of the distance between different points.The critical difference here is that KNN needs
labeled points and is thus supervised learning, while k-means doesn’t — and is thus
unsupervised learning.
Q29) Define precision and recall?
Ans:Recall is also known as the true positive rate: the amount of positives your model claims
compared to the actual number of positives there are throughout the data. Precision is also
known as the positive predictive value, and it is a measure of the amount of accurate positives
your model claims compared to the number of positives it actually claims.
Q30) What is the difference between supervised and unsupervised machine learning?
Ans: Supervised learning requires training labeled data. For example, in order to do
classification (a supervised learning task), you’ll need to first label the data you’ll use to train
the model to classify data into your labeled groups. Unsupervised learning, in contrast, does
not require labeling data explicitly.
Q31) What is bias, variance trade off?
Bias is the error introduced in the model dude to simplification of the machine learning
algorithm. When models are trained simplified assumption are made to make the target
function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine
learning algorithms — Linear Regression, Logistic Regression.
Variance is error introduced in the model due to complex machine learning algorithm, the
model learns noise from the training data set and performs poorly on test data set.
As model complexity increases there will be a reduction in error due to lower bias in the
model. However, this only happens till a particular point. As you continue to make the model
more complex, you end up over-fitting the model and the model will start suffering from high
variance.
Any model should have both low bias and low variance.
Q32)What is Ensemble Learning ?
Ensemble is the art of combining diverse set of learners(Individual models) together to
improvise on the stability and predictive power of the model. Ensemble learning has many
types but two more popular ensemble learning techniques are bagging and boosting .
Bagging tries to implement similar learners on small sample populations and then takes a
mean of all the predictions.
Boosting is an iterative technique which adjusts the weight of an observation based on the
last classification. If an observation was classified incorrectly, it tries to increase the weight
of this observation and vice versa.
Q33) What is selection bias ?
Selection bias occurs when the data points are not randomly selected .Here the sample used
for the statistic is not a good representation of the population.