Interview - Preparation-Machine Learning Questions & Answers
Interview - Preparation-Machine Learning Questions & Answers
Interview Preparation
Q2: Describe the motivation behind random forests and mention two
reasons why they are better than individual decision trees?
Q3: What are the differences and similarities between gradient boosting
and random forest? and what are the advantages and disadvantages of
each when compared to each other?
Q4: What are L1 and L2 regularization? What are the differences between
the two?
Q5: What are the Bias and Variance in a Machine Learning Model and
explain the bias-variance trade-off?
Interview Preparation 1
Q7: Explain briefly the logistic regression model and state an example of
when you have used it recently?
Q8: Explain briefly batch gradient descent, stochastic gradient descent, and
mini-batch gradient descent. and what are the pros and cons for each of
them?
Q9: Explain what is information gain and entropy in the context of decision
trees?
Q10: Explain the linear regression model and discuss its assumption?
Q11: Explain briefly the K-Means clustering and how can we find the best
value of K?
Q12: Define Precision, recall, and F1 and discuss the trade-off between
them?
Q13: What are the differences between a model that minimizes squared
error and the one that minimizes the absolute error? and in which cases
each error metric would be more appropriate?
Q14: Define and compare parametric and non-parametric models and give
two examples for each of them?
Q15: Explain the kernel trick in SVM and why we use it and how to choose
what kernel to use?
Q16: Define the cross-validation process and the motivation behind using
it?
Q17: You are building a binary classifier and you found that the data is
imbalanced, what should you do to handle this situation?
Q18: You are working on a clustering problem, what are different evaluation
metrics that can be used, and how to choose between them?
Q19: What is the ROC curve and when should you use it?
Q20: What is the difference between hard and soft voting classifiers in the
context of ensemble learners?
Interview Preparation 2
Q23: Define the curse of dimensionality and how to solve it.
Q24: In what cases would you use vanilla PCA, Incremental PCA,
Randomized PCA, or Kernel PCA?
Q25: Discuss two clustering algorithms that can scale to large datasets
Q26: Do you need to scale your data if you will be using the SVM classifier
and discuss your answer
Q27: What are Loss Functions and Cost Functions? Explain the key
Difference Between them.
Q28: What is the importance of batch in machine learning and explain some
batch depend on gradient descent algorithm?
Q29: What are the different methods to split a tree in a decision tree
algorithm?
Q33: What are the evaluation metrics that can be used for multi-label
classification?
Q34: What is the difference between concept and data drift and how to
overcome each of them?
Q35: Can you explain the ARIMA model and its components?
Interview Preparation 3
how to treat them. After you recognize the nature of why they occurred, you
can apply one of the several methods below:
Transform the data. For example, you can do a log transformation when the
response variable follows an exponential distribution, or when it is right-
skewed.
Use more robust error metrics such as MAE or Huber loss instead of MSE.
Remove the outliers. However, do this if you are certain that the outliers are
true anomalies not worth adding to your model. This should be your last
consideration since dropping them means losing information.
Interview Preparation 4
The motivation behind random forest or ensemble models can be explained
easily by using the following example: Let’s say we have a question to solve.
We gather 100 people, ask each of them this question, and record their
answers. After we combine all the replies we have received, we will discover
that the aggregated collective opinion will be close to the actual solution to the
problem. This is known as the “Wisdom of the crowd” which is, in fact, the
motivation behind random forests. We take weak learners (ML models)
specifically, decision trees in the case of random forest, and aggregate their
results to get good predictions by removing dependency on a particular set of
features. In regression, we take the mean and for classification, we take the
majority vote of the classifiers.
Generally, you should note that no algorithm is better than the other. It always
depends on the case and the dataset used (Check the No Free Lunch
Theorem). Still, there are reasons why random forests often allow for stronger
prediction than individual decision trees:
Generally, ensemble models like random forests perform better as they are
aggregations of various models (decision trees in the case of a random
forest), using the concept of the “Wisdom of the crowd.”
3. Both are flexible models and do not need much data preprocessing.
The similarities between gradient boosting and random forest can be summed
up like this:
Interview Preparation 5
Both these algorithms are decision-tree based.
Both are also ensemble algorithms - they are flexible models and do not
need much data preprocessing.
Random forest uses Bagging. This means that trees are arranged in a
parallel fashion, where the results from all of them are aggregated at the
end through averaging or majority vote. On, the other hand, gradient
boosting uses Boosting, where trees are arranged in a series sequential
fashion, where every tree tries to minimize the error of the previous one.
When we discuss the advantages and disadvantages between the two it is only
fair to juxtapose them both with their weaknesses and with their strengths. We
need to keep in mind that each one of them is more applicable in certain
instances than the other and vice versa. It depends, on the outcome we want to
reach and the task we need to solve.
On the other hand, we have the advantages of random forest over gradient
boosting as well:
Due to the focus on mistakes during training iterations and the lack of
independence in tree building, gradient boosting is indeed more susceptible
to overfitting. If the data is noisy, the boosted trees might overfit and start
modeling the noise.
Interview Preparation 6
In gradient boosting, training might take longer because every tree is
created sequentially.
Interview Preparation 7
Answer:
The goal of any supervised machine learning model is to estimate the mapping
function (f) that predicts the target variable (y) given input (x). The prediction
error can be broken down into three parts:
The rest of the answer is
here
The goal of any supervised machine learning model is to estimate the mapping
function (f) that predicts the target variable (y) given input (x). The prediction
error can be broken down into three parts:
Bias: The bias is the simplifying assumption made by the model to make the
target function easy to learn. Low bias suggests fewer assumptions made
about the form of the target function. High bias suggests more assumptions
made about the form of the target data. The smaller the bias error the better
the model is. If, however, it is high, this means that the model is underfitting
the training data.
Variance: Variance is the amount that the estimate of the target function
will change if different training data was used. The target function is
estimated from the training data by a machine learning algorithm, so we
should expect the algorithm to have some variance. Ideally, it should not
change too much from one training dataset to the next. This means that the
algorithm is good at picking out the hidden underlying mapping between
the inputs and the output variables. If the variance error is high this
indicates that the model overfits the training data.
Irreducible error: It is the error introduced from the chosen framing of the
problem and may be caused by factors like unknown variables that
influence the mapping of the input variables to the output variable. The
irreducible error cannot be reduced regardless of what algorithm is used.
Supervised machine learning algorithms aim at achieving low bias and low
variance. In turn, the algorithm should also attain good prediction performance.
The parameterization of such ML algorithms is often a battle to balance out bias
and variance. For example, if you want to predict the housing prices given a
large set of potential predictors, a model with high bias but low variance, such
as linear regression, will be easy to implement. However, it will oversimplify the
problem, and, in this context, the predicted house prices will be frequently off
from the market value but the value of the variance of these predicted prices
will be low. On the other hand, a model with low bias and high variance, such as
Interview Preparation 8
a neural network, will lead to predicted house prices closer to the market value,
but with predictions varying widely based on the input features.
There are several different ways to handle these but here, the focus will be on
the most common ones.
The first method is to delete the rows or columns that have null values. This is
an easy and fast way and leads to a robust model. However, it will cause the
loss of a lot of information depending on the amount of missing data.
Therefore, it can only be applied if the missing data represents a small
percentage of the whole dataset.
Some machine learning algorithms are quite efficient when it comes to missing
values in the dataset. The K-NN algorithm can ignore a column from a distance
measure when there are missing values. Naive Bayes can also support missing
values when making a prediction. Another algorithm that can handle a dataset
with missing values or null values is the random forest model, as it can work on
non-linear and categorical data. The problem with this method is that these
models' implementation in the scikit-learn library does not support handling
missing values, so you will have to implement it yourself.
Interview Preparation 9
Data imputation implies the substitution of estimated values for missing or
inconsistent data in your dataset. There are different ways of determining these
replacement values. The simplest one is to change the missing value with the
most repeated one in the row or the column. Another simple solution is
accommodating the mean, median, or mode of the rest of the row, or the
column. The advantage here is that this is an easy and quick fix to missing
data, but it might result in data leakage and does not factor the covariance
between features. A better option is to use an ML model to learn the pattern
between the data and predict the missing values without any data leakage, and
the covariance between the features will be factored. The only drawback here
is the computational complexity, especially for large datasets.
Interview Preparation 10
Batch Gradient Descent:
In Batch Gradient descent the whole training data is used to minimize the loss
function by taking a step toward the nearest minimum by calculating the
gradient (the direction of descent)
Pros:
Since the whole data set is used to calculate the gradient it will be stable and
reach the minimum of the cost function without bouncing (if the learning rate is
chosen cooreclty)
Cons:
Since batch gradient descent uses all the training set to compute the gradient
at every step, it will be very slow especially if the size of the training data is
large.
1. It makes the training much faster as it only works on one instance at a time.
Cons:
Due to the stochastic (random) nature of this algorithm, this algorithm is much
less regular than the batch gradient descent. Instead of gently decreasing until
it reaches the minimum, the cost function will bounce up and down, decreasing
only on average. Over time it will end up very close to the minimum, but once it
gets there it will continue to bounce around, not settling down there. So once
the algorithm stops the final parameters are good but not optimal. For this
reason, it is important to use a training schedule to overcome this randomness.
Mini-batch Gradient:
At each step instead of computing the gradients on the whole data set as in the
Batch Gradient Descent or using one random instance as in the Stochastic
Gradient Descent, this algorithm computes the gradients on small random sets
of instances called mini-batches.
Pros:
Interview Preparation 11
1. The algorithm's progress space is less erratic than with Stochastic Gradient
Descent, especially with large mini-batches.
Cons:
Interview Preparation 12
which in turn minimizes the entropy and best splits the dataset into groups for
effective classification.
Shapiro–Wilk test: If the p-value is lower than the chosen threshold, then
the null hypothesis (Data is normally distributed) is rejected.
1. Low multicollinearity
you can calculate the VIF (Variable Inflation Factors) using your favorite
statistical tool. If the value for each covariate is lower than 10 (some say 5),
you're good to go.
Interview Preparation 13
Q11: Explain briefly the K-Means clustering and how can we
find the best value of K?
K-Means is a well-known clustering algorithm. K-means clustering is often used
because it is easy to interpret and implement. The rest of the answer is [here]
(https://365datascience.com/career-advice/job-interview-tips/machine-
learning-interview-questions-and-answers/#2:~:text=4. Briefly explain the K-
Means clustering and how can we find the best value of K.)
Interview Preparation 14
The elbow method is well-known when finding the best value of K in K-means
clustering. The intuition behind this technique is that the first few clusters will
explain a lot of the variation in the data. However, past a certain point, the
amount of information added is diminishing. Looking at the graph below (figure
1.) of the explained variation (on the y-axis) versus the number of cluster K (on
the x-axis), there should be a sharp change in the y-axis at some level of K. In
this particular case, the drop off is at k=3.
Figure 1. The elbow diagram to find the best value of K in K-Means clustering
The explained variation is quantified by the within-cluster sum of squared
errors. To calculate this error notice, we look for each cluster at the total sum of
squared errors using Euclidean distance.
Another popular alternative method to find the value of K is to apply the
silhouette method, which aims to measure how similar points are in its cluster
compared to other clusters. It can be calculated with this equation: (x-
y)/max(x,y), where x is the mean distance to the examples of the nearest
cluster, and y is the mean distance to other examples in the same cluster. The
coefficient varies between -1 and 1 for any given point. A value of 1 implies that
the point is in the right cluster and the value of -1 implies that it is in the wrong
cluster. By plotting the silhouette coefficient on the y-axis versus each K we
can get an idea of the optimal number of clusters. However, it is worth noting
that this method is more computationally expensive than the previous one.
Interview Preparation 15
Q12: Define Precision, recall, and F1 and discuss the trade-off
between them?
Precision and recall are two classification evaluation metrics that are used
beyond accuracy. The rest of the answer is here
Precision and recall are two classification evaluation metrics that are used
beyond accuracy.
Consider a classification task with many classes. Both metrics are defined for a
particular class, not the model in general. Precision of class, let’s say, A,
indicates the ratio of correct predictions of class A to the total predictions
classified as class A. It is similar to accuracy but applied to a single class.
Therefore, precision may help you judge how likely a given prediction is to be
correct. Recall is the percentage of correctly classified predictions of class A
out of all class A samples present in the test set. It indicates how well our
model can detect the class in question.
In the real world, there is always a trade-off between optimizing for precision
and recall. Consider you are working on a task for classifying cancer patients
from healthy people. Optimizing the model to have only high recall will mean
that the model will catch most of the people with cancer but at the same time,
the number of cancer misdiagnosed people will increase. This will subject
healthy people to dangerous and costly treatments. On the other hand,
optimizing the model to have high precision will make the model confident
about the diagnosis, in favor of missing some people who truly have the
disease. This will lead to fatal outcomes as they will not be treated. Therefore, it
is important to optimize both precision and recall and the percentage of
importance of each of them will depend on the application you are working on.
This leads us to the last point of the question. F1 score is the harmonic mean of
precision and recall, and it is calculated using the following formula: F1 = 2*
(precision*recall) / (precision + recall). The F1 score is used when the recall and
the precision are equally important.
Interview Preparation 16
Both mean square error (MSE) and mean absolute error (MAE) measures the
distances between vectors and express average model prediction in units of
the target variable. Both can range from 0 to infinity, the lower they are the
better the model.
The main difference between them is that in MSE the errors are squared before
being averaged while in MAE they are not. This means that a large weight will
be given to large errors. MSE is useful when large errors in the model are trying
to be avoided. This means that outliers affect MSE more than MAE, that is why
MAE is more robust to outliers.
Computation-wise MSE is easier to use as the gradient calculation will be more
straightforward than MAE, which requires linear programming to calculate it.
Q15: Explain the kernel trick in SVM. Why do we use it and how
to choose what kernel to use?
Answer:
Kernels are used in SVM to map the original input data into a particular higher
dimensional space where it will be easier to find patterns in the data and train
the model with better performance.
For eg.: If we have binary class data which form a ring-like pattern (inner and
outer rings representing two different class instances) when plotted in 2D
Interview Preparation 17
space, a linear SVM kernel will not be able to differentiate the two classes well
when compared to an RBF (radial basis function) kernel, mapping the data into
a particular higher dimensional space where the two classes are clearly
separable.
Typically without the kernel trick, in order to calculate support vectors and
support vector classifiers, we need first to transform data points one by one to
the higher dimensional space, do the calculations based on SVM equations in
the higher dimensional space, and then return the results. The ‘trick’ in the
kernel trick is that we design the kernels based on some conditions as
mathematical functions that are equivalent to a dot product in the higher
dimensional space without even having to transform data points to the higher
dimensional space. i.e. we can calculate support vectors and support vector
classifiers in the same space where the data is provided which saves a lot of
time and calculations.
Having domain knowledge can be very helpful in choosing the optimal kernel
for your problem, however, in the absence of such knowledge following this
default rule can be helpful:
For linear problems, we can try linear or logistic kernels, and for nonlinear
problems, we can use RBF or Gaussian kernels.
Interview Preparation 18
train and test sets where we use the training data to train our model and the
test data to evaluate the performance of the model on unseen data and
validation set for choosing the best hyperparameters. Now, a random split in
most cases(for large datasets) is fine. However, for smaller datasets, it is
susceptible to loss of important information present in the data in which it was
not trained. Hence, cross-validation though computationally expensive
combats this issue.
The process of cross-validation is as follows:
3. For each i in fold 1 to k train the data using all the folds except for fold i and
test on the fold i.
4. Average the K validation/test error from the previous step to get an estimate
of the error.
Interview Preparation 19
Q17: You are building a binary classifier and you found that the
data is imbalanced, what should you do to handle this situation?
Answer:
If there is a data imbalance there are several measures we can take to train a
fairer binary classifier:
1. Pre-Processing:
2. In-Processing:
Regularisation: We can add score terms that measure the data imbalance in
the loss function and therefore minimizing the loss function will also
minimize the degree of imbalance concerning the score chosen which also
indirectly minimizes other metrics that measure the degree of data
imbalance.
Adversarial Debiasing: Here we use the adversarial notion to train the model
where the discriminator tries to detect if there are signs of data imbalance
in the predicted data by the generator and hence the generator learns to
generate data that is less prone to imbalance.
3. Post-Processing:
Odds-Equalization: Here we try to equalize the odds for the classes with
respect to the data is imbalanced for correct imbalance in the trained
Interview Preparation 20
model. Usually, the F1 score is a good choice, if both precision and recall
scores are important
Interview Preparation 21
S = (b-a) / max(a,b)
The 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 for a set of samples is given as the mean of the
Silhouette Coefficient for each sample. The score is bounded between -1 for
incorrect clustering and +1 for highly dense clustering. Scores around zero
indicate overlapping clusters. The score is higher when clusters are dense and
well separated, which relates to a standard concept of a cluster.
Dunn’s Index
Dunn’s Index (DI) is another metric for evaluating a clustering algorithm. Dunn’s
Index is equal to the minimum inter-cluster distance divided by the maximum
cluster size. Note that large inter-cluster distances (better separation) and
smaller cluster sizes (more compact clusters) lead to a higher DI value. A
higher DI implies better clustering. It assumes that better clustering means that
clusters are compact and well-separated from other clusters.
Interview Preparation 22
Q19: What is the ROC curve and when should you use it?
Answer:
ROC curve, Receiver Operating Characteristic curve, is a graphical
representation of the model's performance where we plot the True Positive
Rate (TPR) against the False Positive Rate (FPR) for different threshold values,
for hard classification, between 0 to 1 based on model output.
This ROC curve is mainly used to compare two or more models as shown in the
figure below. Now, it is easy to see that a reasonable model will always give
FPR less (since it's an error) than TPR so, the curve hugs the upper left corner
of the square box 0 to 1 on the TPR axis and 0 to 1 on the FPR axis.
The more the AUC(area under the curve) for a model's ROC curve, the better
the model in terms of prediction accuracy in terms of TPR and FPR.
Here are some benefits of using the ROC Curve :
Can help prioritize either true positives or true negatives depending on your
case study (Helps you visually choose the best hyperparameters for your
case)
Interview Preparation 23
Hard Voting: We take into account the class predictions for each classifier
and then classify an input based on the maximum votes to a particular
class.
Soft Voting: We take into account the probability predictions for each class
by each classifier and then classify an input to the class with maximum
probability based on the average probability (averaged over the classifier's
probabilities) for that class.
Boosting refers to any Ensemble method that can combine several weak
learners into a strong learner. The general idea of most boosting methods is to
train predictors sequentially, each trying to correct its predecessor.
There are many boosting methods available, but by far the most popular are:
Adaptive Boosting: One way for a new predictor to correct its predecessor
is to pay a bit more attention to the training instances that the predecessor
under-fitted. This results in new predictors focusing more and more on the
hard cases.
Interview Preparation 24
Q22: How can you evaluate the performance of a
dimensionality reduction algorithm on your dataset?
Answer:
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a
lot of dimensions from the dataset without losing too much information. One
way to measure this is to apply the reverse transformation and measure the
reconstruction error. However, not all dimensionality reduction algorithms
provide a reverse transformation.
Alternatively, if you are using dimensionality reduction as a preprocessing step
before another Machine Learning algorithm (e.g., a Random Forest classifier),
then you can simply measure the performance of that second algorithm; if
dimensionality reduction did not lose too much information, then the algorithm
should perform just as well as when using the original dataset.
Interview Preparation 25
Possible solutions are:
Remove irrelevant features not discriminating classes correlated or features not
resulting in much improvement, we can use:
Interview Preparation 26
This speeds up the algorithm typically by a factor of 3 or 4 and makes it
possible to cluster huge datasets that do not fit in memory. Scikit-Learn
implements
this algorithm in the MiniBatchKMeans class.
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)
is a clustering algorithm that can cluster large datasets by first generating a
small and compact summary of the large dataset that retains as much
information as possible. This smaller summary is then clustered instead of
clustering the larger dataset.
Q26: Do you need to scale your data if you will be using the
SVM classifier and discus your answer
Answer:
Yes, feature scaling is required for SVM and all margin-based classifiers since
the optimal hyperplane (the decision boundary) is dependent on the scale of
the input features. In other words, the distance between two observations will
differ for scaled and non-scaled cases, leading to different models being
generated.
This can be seen in the figure below, when the features have different scales,
we can see that the decision boundary and the support vectors are only
classifying the X1 features without taking into consideration the X0 feature,
however after scaling the data to the same scale the decision boundaries and
support vectors are looking much better and the model is taking into account
both features.
To scale the data, normalization, and standardization are the most popular
approaches.
Q27: What are Loss Functions and Cost Functions? Explain the
key Difference Between them.
Interview Preparation 27
Answer:
The loss function is the measure of the performance of the model on a single
training example, whereas the cost function is the average loss function over all
training examples or across the batch in the case of mini-batch gradient
descent.
Some examples of loss functions are Mean Squared Error, Binary Cross
Entropy, etc.
Whereas, the cost function is the average of the above loss functions over
training examples.
Interview Preparation 28
following loss functions are used:
Gini's Index
Gini impurity is used to predict the likelihood of a randomly chosen example
being incorrectly classified by a particular node. It’s referred to as an
“impurity” measure because it demonstrates how the model departs from a
simple division.
For regression, the good old mean squared error serves as a good loss function
which is minimized by splits of the input features and predicting the mean value
of the target feature on the subspaces resulting from the split. But finding the
split that results in the minimum possible residual sum of squares is
computationally infeasible, so a greedy top-down approach is taken i.e. the
splits are made at a level from top to down which results in maximum reduction
of RSS. We continue this until some maximum depth or number of leaves is
attained.
1. Stream-based sampling
In stream-based selective sampling, unlabelled data is continuously fed to
Interview Preparation 29
an active learning system, where the learner decides whether to send the
same to a human oracle or not based on a predefined learning strategy.
This method is apt in scenarios where the model is in production and the
data sources/distributions vary over time.
2. Pool-based sampling
In this case, the data samples are chosen from a pool of unlabelled data
based on the informative value scores and sent for manual labeling. Unlike
stream-based sampling, oftentimes, the entire unlabelled dataset is
scrutinized for the selection of the best instances.
Interview Preparation 30
This filter helps in avoiding a cold start for any new products as it doesn't rely
on other users' feedback, it can recommend products based on similarity
factors. However, content-based filtering needs a lot of domain knowledge so
that the recommendations made are 100 percent accurate.
So, instead of focusing on just one user, the collaborative filtering system
focuses on all the users and clusters them according to their interests.
Basically, it recommends a product 'x' to user 'a' based on the interest of user
'b'; users 'a' and 'b' must have had similar interests in the past, which is why
they are clustered together.
The domain knowledge that is required for collaborative filtering is less,
recommendations made are more accurate and it can adapt to the changing
tastes of users over time. However, collaborative filtering faces the problem of
a cold start as it heavily relies on feedback or activity from other users.
More modern approaches typically fall into the hybrid filtering category and
tend to work in two stages:
1). A candidate generation phase where we coarsely generate candidates from
a corpus of hundreds of thousands, millions, or billions of items down to a few
hundred or thousand
1. A ranking phase where we re-rank the candidates into a final top-n set to
be shown to the user. Some systems employ multiple candidate generation
methods and rankers.
Q33: What are the evaluation metrics that can be used for
multi-label classification?
Answer:
Interview Preparation 31
1. Hamming Loss: Hamming Loss is the fraction of labels that are incorrectly
predicted. It is defined as the average number of labels that are predicted
incorrectly per instance.
Precision at k (P@k)
Q34: What is the difference between concept and data drift and
how to overcome each of them?
Answer:
Concept drift and data drift are two different types of problems that can occur
in machine learning systems.
Interview Preparation 32
Data drift, on the other hand, refers to changes in the input data itself over time.
This means that the values of the input feature that the model was trained on
no longer match the values of the input features in the data it is being tested
on. For example, a model that was trained on data from a particular
geographical region may not be as effective at predicting outcomes for data
from a different region.
To overcome concept drift, one approach is to use online learning methods that
allow the model to adapt to new data as it arrives. This involves continually
training the model on the most recent data while using historical data to
maintain context. Another approach is to periodically retrain the model using a
representative sample of the most recent data.
To overcome data drift, one approach is to monitor the input data for changes
and retrain the model when significant changes are detected. This may involve
setting up a monitoring system that alerts the user when the data distribution
changes beyond a certain threshold.
Another approach is to preprocess the input data to remove or mitigate the
effects of the features changing over time so that the model can continue
learning from the remaining features.
Interview Preparation 33
Q35: Can you explain the ARIMA model and its components?
Answer:
The ARIMA model, which stands for Autoregressive Integrated Moving Average,
is a widely used time series forecasting model. It combines three key
components: Autoregression (AR), Differencing (I), and Moving Average (MA).
Autoregression (AR):
The autoregressive component captures the relationship between an
observation in a time series and a certain number of lagged observations. It
assumes that the value at a given time depends linearly on its own previous
values. The "p" parameter in ARIMA(p, d, q) represents the order of
Interview Preparation 34
autoregressive terms. For example, ARIMA(1, 0, 0) refers to a model with
one autoregressive term.
Differencing (I):
Differencing is used to make a time series stationary by removing trends or
seasonality. It calculates the difference between consecutive observations
to eliminate any non-stationary behavior. The "d" parameter in ARIMA(p, d,
q) represents the order of differencing. For instance, ARIMA(0, 1, 0)
indicates that differencing is applied once.
By combining these three components, the ARIMA model can capture both
autoregressive patterns, temporal dependencies, and stationary behavior in a
time series. The parameters p, d, and q are typically determined through
techniques like the Akaike Information Criterion (AIC) or Bayesian Information
Criterion (BIC).
It's worth noting that there are variations of the ARIMA model, such as SARIMA
(Seasonal ARIMA), which incorporates additional seasonal components for
modeling seasonal patterns in the data.
ARIMA models are widely used in forecasting applications, but they do make
certain assumptions about the underlying data, such as linearity and
stationarity. It's important to validate these assumptions and adjust the model
accordingly if they are not met.
Interview Preparation 35
Q36: What are the assumptions made by the ARIMA model?
Answer:
The ARIMA model makes several assumptions about the underlying time series
data. These assumptions are important to ensure the validity and accuracy of
the model's results. Here are the key assumptions:
Stationarity: The ARIMA model assumes that the time series is stationary.
Stationarity means that the statistical properties of the data, such as the mean
and variance, remain constant over time. This assumption is crucial for the
autoregressive and moving average components to hold. If the time series is
non-stationary, differencing (the "I" component) is applied to transform it into a
stationary series.
Linearity: The ARIMA model assumes that the relationship between the
observations and the lagged values is linear. It assumes that the future values
of the time series can be modeled as a linear combination of past values and
error terms.
No Autocorrelation in Residuals: The ARIMA model assumes that the residuals
(the differences between the predicted values and the actual values) do not
exhibit any autocorrelation. In other words, the errors are not correlated with
each other.
Normally Distributed Residuals: The ARIMA model assumes that the residuals
follow a normal distribution with a mean of zero. This assumption is necessary
for statistical inference, parameter estimation, and hypothesis testing.
It's important to note that while these assumptions are commonly made in
ARIMA modeling, they may not always hold in real-world scenarios. It's
essential to assess the data and, if needed, apply transformations or consider
alternative models that relax some of these assumptions. Additionally,
Interview Preparation 36
diagnostics tools, such as residual analysis and statistical tests, can help
evaluate the adequacy of the assumptions and the model's fit to the data.
Interview Preparation 37