Data Science Interview Questions
Data Science Interview Questions
2. How would you describe linear regression to someone who isn't familiar with the
subject?
4. How will you verify if the items present in list A are present in series B?
We will use the isin() function. For this, we create two series s1 and s2 –
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([4, 5, 6, 7, 8])
s1[s1.isin(s2)]
5. How to find the positions of numbers that are multiples of 4 from a series?
For finding the multiples of 4, we will use the argwhere() function. First, we will
create a list of 10 numbers –
s1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.argwhere(ser % 4==0)
Output > [3], [7]
Deep Learning is the name given to machine learning in Data Science that demands
a high level of resemblance to the functioning of the human brain. It is a machine
learning paradigm in this approach.
Answer: To become a certified Data Scientist, you'll need the following skills:
● Lists, tuples, sets, and related data types are all built-in data types.
● NumPy Arrays in N Dimensions Expertise
● Pandas Dataframes can be used.
● In element-wise vectors, there is excellent holdover performance.
● Matrix operations on NumPy arrays are understood.
Cost functions are a technique for determining how good a model's performance is.
It accounts for the losses and errors that occur in the output layer during the
backpropagation process. In this example, the faults in the neural network are
pushed backwards, and many different training functions are used.
14. What does it mean to iterate in Data Science? Could you give an example?
Epoch uses iteration in Data Science for analysing data. As a result, iteration is the
categorising of data into several groups. For example, if there are 50,000 photos and
the batch size is 100, the Epoch will run around 500 iterations.
In Data Science, an epoch is one of the iterations throughout the complete dataset. It
encompasses everything related to the learning model.
On CNN, there are four different tiers to choose from. The following are some of
them.
● Convolutional Layer: This layer creates a series of little picture windows to go
through the data.
● ReLU Layer: This layer aids in the creation of non-linearity in the network by
converting negative pixels to zero, resulting in a corrected feature map.
● Pooling Layer: This layer decreases the feature map's dimensionality.
● Fully Connected Layer: This layer recognises and categorises the image's
objects.
Recurrent Neural Networks (RNN) is the acronym for recurrent neural networks.
They are a type of artificial neural network that is made up of a series of data, such
as financial markets, time series, and a variety of other things. The main goal of the
RNN application is to learn the fundamentals of feedforward networks.
Polling is a technique that is used to lower the spatial dimensions of a CNN. It aids
in downsampling procedures for dimensionality reduction and the creation of
pooled feature maps. In CNN, pooling assists in the sliding of the filter matrix over
the input matrix.
21. What does LSTM stand for in full? What is its purpose?
Long Short Term Memory (LSTM) is the acronym for Long Short Term Memory. As
part of its default nature, it is a recurrent neural network capable of learning
long-term dependencies and recalling information for a longer length of time.
Vanishing gradients is a condition that occurs when the slope of an RNN's training
process is too small. Poor performance outcomes, low precision, and long-term
training processes are all consequences of vanishing gradients.
The following are some of the different types of Deep Learning Frameworks:
● Caffe
● Keras
● TensorFlow
● Pytorch
● Chainer
● Microsoft Cognitive Toolkit
25. What is the purpose of an Activation function?
26. What are the different Machine Learning Libraries and what are their
advantages?
The following are the various machine learning libraries and their advantages.
● Numpy is a scientific computing library.
● Statsmodels is a time-series analysis tool.
● Pandas is a tool for analysing tubular data.
● Scikit-learn is a data modelling and pre-processing toolkit.
● Tensorflow: Tensorflow is a deep learning framework.
● Regular Expressions are used to process text.
● Pytorch: Pytorch is a deep learning framework.
● NLTK is a text processing library.
Auto-encoders are learning networks designed to convert inputs into outputs with
the fewest possible errors. They want to keep the output as near to the input as
possible. The construction of layers between the input and output is required for the
Autoencoders process. However, for speedier processing, attempts are made to keep
the size of these layers as minimal as possible.
Dropout is a toll in Data Science that is used to randomly drop out the hidden and
visible units of a network. They avoid data overfitting by removing up to 20% of
nodes to save up space for the iterations needed to converge the network.
30. Why is Tensorflow a top priority when it comes to learning Data Science?
Tensorflow has a high priority in learning Data Science since it facilitates the use of
computer languages such as C++ and Python. As a result, it enables various data
science processes to accomplish faster compilation and completion within the time
range specified, as well as faster than the traditional Keras and Torch libraries. For
faster inputs, editing, and analysis of data, Tensorflow supports computational units
such as the CPU and GPU.
GAN is made up of two essential components. The following are some of them:
● Generator: The Generator functions as a forger, creating forged copies.
● Discriminator: The Discriminator acts as a distinguisher between fraudulent
and genuine copies.
34. Please describe what a Boltzmann Machine is and how it works.
A Boltzmann Machine has a simple learning method that allows it to find exciting
features in the training data that signify complicated regularities. It's mostly used to
find the best quantity and weight for a given problem.
In networks with numerous layers of feature detectors, the Boltzmann Machine's
simple learning process is quite sluggish.
Autoencoders are simple learning networks that convert inputs into outputs with the
least amount of error feasible. It indicates that the outputs are fairly similar to the
inputs.
Between the input and the output, a few layers are added, with each layer's size
being smaller than the input layer. An autoencoder takes unlabeled input and
encodes it in order to reconstruct the output.
37. What are outlier values, and how should they be handled?
Outlier values, or simply outliers, are statistical data points that do not belong to a
certain population. An outlier value is an unusual observation that differs
significantly from the other values in the collection.
39. What are the differences between linear and logistic regression?
Linear regression is a statistical approach in which the score of one variable Y is
predicted based on the score of another variable X, known as the predictor variable.
The criterion variable is represented by the Y variable.
Logistic regression, often known as the logit model, is a statistical technique for
predicting a binary result using a linear combination of predictor variables.
A validation set is a subset of the training set that is used to select parameters and
avoid overfitting in the machine learning model being built. A test set, on the other
hand, is used to evaluate or test the performance of a trained machine learning
model.
42. Which of Python and R would you use for text analytics, and why?
A/B testing is a statistical hypothesis test used in a randomised experiment with two
variables, A and B. By recognising any modifications to a webpage, the purpose of
A/B Testing is to maximise the possibility of an outcome of some interest.
A/B Testing is a highly reliable method for determining the best online marketing
and promotional strategies for a company. It can be used to test anything from sales
emails to search ads and website copy.
44. What are the differences between univariate, bivariate, and multivariate analysis?
Univariate, bivariate, and multivariate analyses are the three types of analyses.
● Univariate analysis refers to descriptive statistical analysis procedures that are
distinguished by the number of variables involved. A single variable can be
used in some pie charts.
● At the same time, bivariate analysis explains the difference between two
variables. A scatterplot can be used to examine the amount of sales and
spending.
● More than two variables are used in multivariate analysis, which explains the
impact of variables on answers.
45. How do you choose the best K value for K-means clustering?
The elbow technique and the kernel approach are two ways for determining the
number of centroids in a given cluster. However, we can rapidly determine the
approximate number of centroids by taking the square root of the number of data
points divided by two. While this technique isn't completely accurate, it is faster than
the others.
Hadoop allows data scientists to work with vast amounts of unstructured data.
Furthermore, new Hadoop extensions like Mahout and PIG offer a variety of
functionalities for analysing and implementing machine learning algorithms on
massive data sets. As a result, Hadoop is a complete system capable of processing a
wide range of data types, making it an ideal tool for data scientists.
47. Could you provide me with some examples of NoSQL databases?
Redis, MongoDB, Cassandra, HBase, Neo4j, and other NoSQL databases are some of
the most popular.
Relational Database Management Systems, or RDBMS, are the subject of SQL. This
sort of database holds structured data in the form of a table, which is organised in
rows and columns. NoSQL, on the other hand, is a query language for
Non-Relational Database Management Systems. The information on this page is
unorganised. Services, devices, and software systems are the most common sources
of structured data. Unstructured data, on the other hand, is generated directly by
consumers and is growing by the day.
49. What exactly do you mean when you say "data integrity"?
Data integrity allows us to define the data's accuracy as well as consistency. This
integrity must be maintained throughout the life of the product.
A foreign key is a unique key that belongs to one database but can also be used as a
main key in another. We reference the foreign key with the primary key of the other
table to create a relationship between the two tables.
To delete some rows from a table, use the DELETE command in conjunction with the
WHERE clause. This action has the ability to be reversed.
TRUNCATE, on the other hand, is used to delete all the rows in a table, and this
action cannot be reversed.
The pickle module is used to serialise and deserialize objects in Python. Pickle is
used to save this object to the hard drive. It converts a character stream from an
object structure.
The fraction of instances that have been categorised as true is known as recall.
Precision, on the other hand, is a metric for weighing instances that are genuinely
true. Precision is a genuine value that shows factual information, whereas recall is an
estimate.
55. What does it mean to have the "curse of dimensionality"? What are our options
for resolving it?
There are times when the amount of variables or columns in the dataset is excessive
while evaluating it. We are, however, only allowed to extract significant variables
from the group. Consider the fact that there are a thousand features. However, we
only need to extract a few key characteristics. The ‘curse of dimensionality' refers to
the dilemma of having many features when only a few are required.
There are a variety of dimensionality reduction algorithms available, such as PCA
(Principal Component Analysis).
The Box Cox Transformation is used to convert the response variable so that the data
matches the required assumptions. This technique can be used to convert
non-normal dependent variables into normal shapes. With the aid of this
transformation, we can run a larger number of tests.
57. Why don't gradient descent algorithms always lead to the same result?
This is due to the fact that they sometimes reach a local or local optima point. The
methods aren't always successful in achieving global minima. This is also reliant on
the data, the velocity of fall, and the point of descent's origin.
We compute the significance of the results after doing a hypothesis test. Between 0
and 1, the p-value is present. If the p-value is less than 0.05, it means that the null
hypothesis cannot be rejected. If it is larger than 0.05, however, the null hypothesis is
rejected.
60. What exactly is SVM? Can you name some of the SVM kernels?
Support vector machine (SVM) is an acronym for support vector machine. They're
utilised for jobs like categorization and prediction. SVM is made up of a separation
plane that separates the two types of variables. Hyperplane is the term for this
separating plane. The following are some of the kernels used in SVM:
● Polynomial Kernel
● Gaussian Kernel
● Laplace RBF Kernel
● Sigmoid Kernel
● Hyperbolic Kernel
A confusion matrix is a table that shows how well a supervised learning system
performs. It gives a summary of categorization problem prediction outcomes. You
can use the confusion matrix to not only determine the predictor's errors, but also the
types of errors.
The assumptions and probabilities generated for the characteristics in Naive Bayes
are independent of each other. The premise of feature independence is what
distinguishes Naive Bayes.
The AUC curve is a comparison of precision and recall. Precision is calculated using
the formulas TP/(TP + FP) and TP/(TP + FN). ROC, on the other hand, measures
and plots True Positive against False Positive Rate.
68. What are some of the most common data quality issues encountered when
working with Big Data?
When dealing with big data, some of the major quality issues are duplicate data,
incomplete data, inconsistent data format, incorrect data, volume of data (big data),
lack of a proper storage mechanism, and so on.
70. What is the distinction between a tree map and a heat map?
A heat map is a type of visualisation tool that uses colours and size to compare
different categories. It is useful for comparing two different measures. The ‘tree map'
is a type of chart that depicts hierarchical data or part-to-whole relationships.
71. What is the difference between data disaggregation and data aggregation?
Scatterplot matrices are the most commonly used method for visualising
multidimensional data. It is used to depict bivariate relationships between a set of
variables.
groupby is a special function in pandas that is used to group rows together given
specific columns that contain information for categories used for data grouping.
Data aggregation is a process in which aggregate functions are used to obtain the
required results following a groupby. Sum, count, avg, max, and min are examples of
common aggregation functions.
SAS: it is one of the most widely used analytics tools used by some of the biggest
companies on earth. It has some of the best statistical functions, graphical user
interface, but can come with a price tag and hence it cannot be readily adopted by
smaller enterprises
R: The best part about R is that it is an Open Source tool and hence used generously
by academia and the research community. It is a robust tool for statistical
computation, graphical representation and reporting. Due to its open source nature
it is always being updated with the latest features and then readily available to
everybody.
The R programming language includes a set of software suites that is used for
graphical representation, statistical computing, data manipulation and calculation.
Statistics helps Data Scientists to look into the data for patterns, hidden insights and
convert Big Data into Big insights. It helps to get a better idea of what the customers
are expecting. Data Scientists can learn about consumer behavior, interest,
engagement, retention and finally conversion all through the power of insightful
statistics. It helps them to build powerful data models in order to validate certain
inferences and predictions. All this can be converted into a powerful business
proposition by giving users what they want at precisely when they want it.
Once the data is cleaned it confirms with the rules of the data sets in the system.
Data cleansing is an essential part of data science because the data can be prone to
error due to human negligence, corruption during transmission or storage among
other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist
because of the multiple sources from which data emanates and the speed at which it
comes.
As the name suggests these are analysis methodologies having a single, double or
multiple variables.
So a univariate analysis will have one variable and due to this there are no
relationships, causes. The major aspect of the univariate analysis is to summarize the
data and find the patterns within it to make actionable decisions.
A Bivariate analysis deals with the relationship between two sets of data. These sets
of paired data come from related sources, or samples. There are various tools to
analyze such data including the chi-squared tests and t-tests when the data are
having a correlation.
If the data can be quantified then it can be analyzed using a graph plot or a scatter
plot. The strength of the correlation between the two data sets will be tested in a
Bivariate analysis.
Root cause analysis was initially developed to analyze industrial accidents, but is
now widely used in other areas. It is basically a technique of problem solving used
for isolating the root causes of faults or problems. A factor is called a root cause if its
deduction from the problem-fault-sequence averts the final undesirable event from
recurring.
89. Explain Cross-validation.?
The goal of cross-validation is to term a data set to test the model in the training
phase (i.e. validation data set) in order to limit problems like over-fitting, and get an
insight on how the model will generalize to an independent data set.
The process of filtering used by most of the recommender systems to find patterns or
information by collaborating perspectives, numerous data sources and several
agents.
No, they do not because in some cases it reaches a local minima or a local optima
point. You will not reach the global optima point. This is governed by the data and
the starting conditions.
It is a theorem that describes the result of performing the same experiment a large
number of times. This theorem forms the basis of frequency-style thinking. It says
that the sample mean, the sample variance and the sample standard deviation
converge to what they are trying to estimate.
It is a traditional database schema with a central table. Satellite tables map ID’s to
physical name or description and can be connected to the central fact table using the
ID fields; these tables are known as lookup tables, and are principally useful in
real-time applications, as they save a lot of memory. Sometimes star schemas involve
several layers of summarization to recover information faster.
101. What Are The Types Of Biases That Can Occur During Sampling?
● Selection bias
● Under coverage bias
● Survivorship bias
“Suppose that we are interested in estimating the average height among all people.
Collecting data for every person in the world is impossible. While we can’t obtain a
height measurement from everyone in the population, we can still sample some
people. The question now becomes, what can we say about the average height of the
entire population given a single sample. The Central Limit Theorem addresses this
question exactly.”
104. What is sampling? How many sampling methods do you know?
“A type I error occurs when the null hypothesis is true, but is rejected. A type II error
occurs when the null hypothesis is false, but erroneously fails to be rejected.”
106. What is linear regression? What do the terms p-value, coefficient, and r-squared
value mean? What is the significance of each of these components?
A linear regression is a good tool for quick predictive analysis: for example, the price
of a house depends on a myriad of factors, such as its size or its location. In order to
see the relationship between these variables, we need to build a linear regression,
which predicts the line of best fit between them and can help conclude whether or
not these two factors have a positive or negative relationship.
”Basically, an interaction is when the effect of one factor (input variable) on the
dependent variable (output variable) differs among levels of another factor.”
109. What is selection bias?
“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that
is gathered and prepared for modeling has characteristics that are not representative
of the true, future population of cases the model will see. That is, active selection bias
occurs when a subset of the data are systematically (i.e., non-randomly) excluded
from analysis.”
Logistic Regression is also called the logit model. It is a method to forecast the binary
outcome from a linear combination of predictor variables.
113. Name three types of biases that can occur during sampling
In the sampling process, there are three types of biases, which are:
● Selection bias
● Under coverage bias
● Survivorship bias
Prior probability is the proportion of the dependent variable in the data set while the
likelihood is the probability of classifying a given observant in the presence of some
other variable.
119. List out the libraries in Python used for Data Analysis and Scientific
Computations.
● SciPy
● Pandas
● Matplotlib
● NumPy
● SciKit
● Seaborn
The power analysis is an integral part of the experimental design. It helps you to
determine the sample size required to find out the effect of a given size from a cause
with a specific level of assurance. It also allows you to deploy a particular
probability in a sample size constraint.
The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the
probability of an event. It is based on prior knowledge of conditions which might be
related to that specific event.
125. State the difference between the expected value and mean value
There are not many differences, but both of these terms are used in different
contexts. Mean value is generally referred to when you are discussing a probability
distribution whereas expected value is referred to in the context of a random
variable.
AB testing used to conduct random experiments with two variables, A and B. The
goal of this testing method is to find out changes to a web page to maximize or
increase the outcome of a strategy.
Bagging
Bagging method helps you to implement similar learners on small sample
populations. It helps you to make nearer predictions.
Boosting
Boosting is an iterative method which allows you to adjust the weight of an
observation depending upon the last classification. Boosting decreases the bias error
and helps you to build strong predictive models.
Artificial Neural networks (ANN) are a special set of algorithms that have
revolutionized machine learning. It helps you to adapt according to changing input.
So the network generates the best possible result without redesigning the output
criteria.
Back-propagation is the essence of neural net training. It is the method of tuning the
weights of a neural net depending upon the error rate obtained in the previous
epoch. Proper tuning of the helps you to reduce error rates and to make the model
reliable by increasing its generalization.
Random forest is a machine learning method which helps you to perform all types of
regression and classification tasks. It is also used for treating missing values and
outlier values.
136. Explain the difference between Data Science and Data Analytics
Data Scientists need to slice data to extract valuable insights that a data analyst can
apply to real-world business scenarios. The main difference between the two is that
the data scientists have more technical knowledge than business analysts. Moreover,
they don't need an understanding of the business required for data visualization.
When you conduct a hypothesis test in statistics, a p-value allows you to determine
the strength of your results. It is a numerical number between 0 and 1. Based on the
value it will help you to denote the strength of the specific result.
139. Explain the method to collect and analyze data to use social media to predict the
weather condition.
You can collect social media data using Facebook, twitter, Instagram's API. For
example, for the tweeter, we can construct a feature from each tweet like tweeted
date, retweets, list of followers, etc. Then you can use a multivariate time series
model to predict the weather condition.
140. When do you need to update the algorithm in Data science?
Python will be more suitable for text analytics as it consists of a rich library known
as pandas. It allows you to use high-level data analysis tools and data structures,
while R doesn't offer this feature.
Statistics help Data scientists to get a better idea of customer's expectation. Using the
statistical method Data Scientists can get knowledge regarding consumer interest,
behavior, engagement, retention, etc. It also helps you to build powerful data models
to validate certain inferences and predictions.
● Pytorch
● Microsoft Cognitive Toolkit
● TensorFlow
● Caffe
● Chainer
● Keras
145.Explain Auto-Encoder
Autoencoders are learning networks. It helps you to transform inputs into outputs
with fewer numbers of errors. This means that you will get output to be as close to
input as possible.
Boltzmann machines are a simple learning algorithm. It helps you to discover those
features that represent complex regularities in the training data. This algorithm
allows you to optimize the weights and the quantity for the given problem.
147. Explain why Data Cleansing is essential and which method you use to maintain
clean data
Dirty data often leads to the incorrect inside, which can damage the prospect of any
organization. For example, if you want to run a targeted marketing campaign.
However, our data incorrectly tell you that a specific product will be in-demand with
your target audience; the campaign will fail.
Skewed distribution occurs when data is distributed on any one side of the plot
whereas uniform distribution is identified when the data is spread is equal in the
range.