KEMBAR78
Data Science Interview Questions | PDF | Regression Analysis | Logistic Regression
0% found this document useful (0 votes)
8 views32 pages

Data Science Interview Questions

The document contains a comprehensive list of data science interview questions and answers covering topics such as normal distribution, linear regression, handling missing data, ensemble learning, deep learning frameworks, and various machine learning concepts. It also discusses specific algorithms, techniques, and tools used in data science, including KNN, CNN, RNN, LSTM, and A/B testing. Additionally, it highlights the importance of libraries like TensorFlow and Pandas, as well as the role of Hadoop in managing large datasets.

Uploaded by

Mukund Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

Data Science Interview Questions

The document contains a comprehensive list of data science interview questions and answers covering topics such as normal distribution, linear regression, handling missing data, ensemble learning, deep learning frameworks, and various machine learning concepts. It also discusses specific algorithms, techniques, and tools used in data science, including KNN, CNN, RNN, LSTM, and A/B testing. Additionally, it highlights the importance of libraries like TensorFlow and Pandas, as well as the role of Hadoop in managing large datasets.

Uploaded by

Mukund Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data science Interview Questions

1. What do you mean when you say "normal distribution"?

Gaussian Distribution is another name for Normal Distribution. It's a form of


symmetrical probability distribution around the mean. It demonstrates that the data
is closer to the mean and that the frequency of occurrences in the data is significantly
lower than the mean.

2. How would you describe linear regression to someone who isn't familiar with the
subject?

Linear regression is a statistical approach for determining if two variables have a


linear relationship. By linear relationship, we mean that an increase in one variable
leads to an increase in the other, and that a drop in one variable leads to a reduction
in the other. We create a model based on this linear relationship that predicts future
events based on an increase in one variable.

3. How will you deal with missing data values?

There are numerous approaches to dealing with missing values in a dataset.


● Values are being reduced.
● The observation has been deleted (not always recommended).
● Value is replaced by the observation's mean, median, and mode.
● Using regression to predict value
● Clustering can help you find the right value.

4. How will you verify if the items present in list A are present in series B?

We will use the isin() function. For this, we create two series s1 and s2 –

s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([4, 5, 6, 7, 8])
s1[s1.isin(s2)]

5. How to find the positions of numbers that are multiples of 4 from a series?
For finding the multiples of 4, we will use the argwhere() function. First, we will
create a list of 10 numbers –

s1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.argwhere(ser % 4==0)
Output > [3], [7]

6. What is the difference between KNN and K-means clustering?

KNN is a supervised learning algorithm, to begin with. Labeled data is required in


order to train this algorithm. K-means is an unsupervised learning method that
searches for inherent patterns in data. The K in KNN stands for the number of data
points that are closest to each other. The K in K-means, on the other hand, specifies
the number of centroids.

7. What different types of Ensemble learning are there?

The following are examples of several types of Ensemble learning.


● Bagging: It employs simple learners on a small sample size and uses the mean
to estimate.
● Boosting: Before making an outcome prediction, it modifies the weight of the
observation and thereby divides the population into various sets.

8. What is ensemble learning, and how does it work?

Ensemble learning is a method of merging a diverse group of learners, or individual


models, with one another. It aids in strengthening the model's stability and
predictive capacity.

9. What is Data Science's Deep Learning?

Deep Learning is the name given to machine learning in Data Science that demands
a high level of resemblance to the functioning of the human brain. It is a machine
learning paradigm in this approach.

10. In the field of data science, what is an artificial neural network?


In data science, an artificial neural network is a set of algorithms inspired by
biological neural networks and designed to adapt to changes in input to produce the
best possible output. It aids in the generation of the best possible results without
requiring the output methods to be redesigned.

11. What qualifications are required to become a certified Data Scientist?

Answer: To become a certified Data Scientist, you'll need the following skills:
● Lists, tuples, sets, and related data types are all built-in data types.
● NumPy Arrays in N Dimensions Expertise
● Pandas Dataframes can be used.
● In element-wise vectors, there is excellent holdover performance.
● Matrix operations on NumPy arrays are understood.

12. What are hyperparameters, and how do you use them?

A hyperparameter is a type of parameter whose value is determined prior to the


learning process in order to identify network training requirements and improve
network structure. Recognizing hidden units, learning rate, epochs, and other
variables are all part of this process.

13. What is the purpose of the cost function?

Cost functions are a technique for determining how good a model's performance is.
It accounts for the losses and errors that occur in the output layer during the
backpropagation process. In this example, the faults in the neural network are
pushed backwards, and many different training functions are used.

14. What does it mean to iterate in Data Science? Could you give an example?

Epoch uses iteration in Data Science for analysing data. As a result, iteration is the
categorising of data into several groups. For example, if there are 50,000 photos and
the batch size is 100, the Epoch will run around 500 iterations.

15. In data science, what is a batch?


A batch is a separate dataset that is broken into several batches to aid in the
transmission of information into the system. It is created when the developer is
unable to feed the complete dataset into the neural network at once.

16. In data science, what is an epoch?

In Data Science, an epoch is one of the iterations throughout the complete dataset. It
encompasses everything related to the learning model.

17. What are CNN's different layers?

On CNN, there are four different tiers to choose from. The following are some of
them.
● Convolutional Layer: This layer creates a series of little picture windows to go
through the data.
● ReLU Layer: This layer aids in the creation of non-linearity in the network by
converting negative pixels to zero, resulting in a corrected feature map.
● Pooling Layer: This layer decreases the feature map's dimensionality.
● Fully Connected Layer: This layer recognises and categorises the image's
objects.

18. What exactly is an RNN?

Recurrent Neural Networks (RNN) is the acronym for recurrent neural networks.
They are a type of artificial neural network that is made up of a series of data, such
as financial markets, time series, and a variety of other things. The main goal of the
RNN application is to learn the fundamentals of feedforward networks.

19. What is CNN's Pooling?

Polling is a technique that is used to lower the spatial dimensions of a CNN. It aids
in downsampling procedures for dimensionality reduction and the creation of
pooled feature maps. In CNN, pooling assists in the sliding of the filter matrix over
the input matrix.

20. What are the various LSTM steps?


The following are the several steps in the LSTM process.
● Step 1: The network assists in determining what should be remembered and
what should be forgotten.
● Step 2: The cell state values that can be updated are selected.
● Step 3: The network determines what can be produced with the current
output.

21. What does LSTM stand for in full? What is its purpose?

Long Short Term Memory (LSTM) is the acronym for Long Short Term Memory. As
part of its default nature, it is a recurrent neural network capable of learning
long-term dependencies and recalling information for a longer length of time.

22. Exploding gradients are what they sound like.

During the training of an RNN, bursting gradients is a condition in which the


mistakes develop at an exponential or rapid rate. This error gradient builds up,
causing the neural network to apply huge updates, causing an overflow, and
resulting in NaN values.

23. What are vanishing gradients, and what do they mean?

Vanishing gradients is a condition that occurs when the slope of an RNN's training
process is too small. Poor performance outcomes, low precision, and long-term
training processes are all consequences of vanishing gradients.

24. What are the various Deep Learning Framework types?

The following are some of the different types of Deep Learning Frameworks:
● Caffe
● Keras
● TensorFlow
● Pytorch
● Chainer
● Microsoft Cognitive Toolkit
25. What is the purpose of an Activation function?

An activation function aids in the introduction of nonlinearity into the neural


network. This is done to aid the process of learning complex functions. The neural
network will be unable to perform solely the linear function and apply linear
combinations without the activation function. As a result, the activation function
provides complex functions and combinations by employing artificial neurons,
which aid in the delivery of output dependent on inputs.

26. What are the different Machine Learning Libraries and what are their
advantages?

The following are the various machine learning libraries and their advantages.
● Numpy is a scientific computing library.
● Statsmodels is a time-series analysis tool.
● Pandas is a tool for analysing tubular data.
● Scikit-learn is a data modelling and pre-processing toolkit.
● Tensorflow: Tensorflow is a deep learning framework.
● Regular Expressions are used to process text.
● Pytorch: Pytorch is a deep learning framework.
● NLTK is a text processing library.

27. What are Auto-Encoders and How Do They Work?

Auto-encoders are learning networks designed to convert inputs into outputs with
the fewest possible errors. They want to keep the output as near to the input as
possible. The construction of layers between the input and output is required for the
Autoencoders process. However, for speedier processing, attempts are made to keep
the size of these layers as minimal as possible.

28. In Data Science, what is batch normalisation?

In Data Science, batch normalisation is a strategy for attempting to improve the


performance and stability of a neural network. This can be accomplished by
normalising each layer's inputs so that the mean output activation remains 0 and the
standard deviation remains 1.
29. What is Data Science Dropout?

Dropout is a toll in Data Science that is used to randomly drop out the hidden and
visible units of a network. They avoid data overfitting by removing up to 20% of
nodes to save up space for the iterations needed to converge the network.

30. Why is Tensorflow a top priority when it comes to learning Data Science?

Tensorflow has a high priority in learning Data Science since it facilitates the use of
computer languages such as C++ and Python. As a result, it enables various data
science processes to accomplish faster compilation and completion within the time
range specified, as well as faster than the traditional Keras and Torch libraries. For
faster inputs, editing, and analysis of data, Tensorflow supports computational units
such as the CPU and GPU.

31. What exactly are tensors?

Tensors are mathematical objects that represent a set of higher-dimensional data


inputs such as alphabets, numerals, and rank that are supplied into a neural
network.

32. What is a Computational Graph, and how does it work?

A computational graph is a TensorFlow-based graphical display. It has a large


network of several types of nodes, each of which symbolises a different
mathematical operation. Tensors are the edges between these nodes. The
computational graph is known as a TensorFlow of inputs because of this. Data flows
in the shape of a graph describe the computational graph, which is also known as
the DataFlow Graph.

33. What are the essential elements of GAN?

GAN is made up of two essential components. The following are some of them:
● Generator: The Generator functions as a forger, creating forged copies.
● Discriminator: The Discriminator acts as a distinguisher between fraudulent
and genuine copies.
34. Please describe what a Boltzmann Machine is and how it works.

A Boltzmann Machine has a simple learning method that allows it to find exciting
features in the training data that signify complicated regularities. It's mostly used to
find the best quantity and weight for a given problem.
In networks with numerous layers of feature detectors, the Boltzmann Machine's
simple learning process is quite sluggish.

35. What are your thoughts on autoencoders?

Autoencoders are simple learning networks that convert inputs into outputs with the
least amount of error feasible. It indicates that the outputs are fairly similar to the
inputs.
Between the input and the output, a few layers are added, with each layer's size
being smaller than the input layer. An autoencoder takes unlabeled input and
encodes it in order to reconstruct the output.

36. Please explain the concept of Gradient Descent.

A gradient is the amount of change in the output of a function as a function of the


changes made to the inputs. It calculates the change in all weights in relation to the
error. A function's slope can also be thought of as a gradient.
The term "gradient descent" refers to descending to the valley's bottom. Simply put,
think of this as an alternative to ascending a hill. It's a minimization algorithm for
reducing the size of an activation function.

37. What are outlier values, and how should they be handled?

Outlier values, or simply outliers, are statistical data points that do not belong to a
certain population. An outlier value is an unusual observation that differs
significantly from the other values in the collection.

Outlier values are commonly treated in one of two ways:


● To adjust the value so that it falls inside a certain range
● To simply take the value away

38. Please provide an explanation of Recommender Systems as well as an


application.
Recommender Systems are a type of information filtering system that predicts a
user's preferences or ratings for a certain product.
The Amazon product recommendations section is an example of a recommender
system in action. Items in this section are based on the user's search history and
previous orders.

39. What are the differences between linear and logistic regression?
Linear regression is a statistical approach in which the score of one variable Y is
predicted based on the score of another variable X, known as the predictor variable.
The criterion variable is represented by the Y variable.
Logistic regression, often known as the logit model, is a statistical technique for
predicting a binary result using a linear combination of predictor variables.

40. Is it possible to compare the validation and test sets?

A validation set is a subset of the training set that is used to select parameters and
avoid overfitting in the machine learning model being built. A test set, on the other
hand, is used to evaluate or test the performance of a trained machine learning
model.

41. Explain the concepts of Eigenvectors and Eigenvalues.

Eigenvectors aid in the comprehension of linear transformations. In data analysis,


they are often calculated for a correlation or covariance matrix.
In other words, eigenvectors are the directions along which a linear transformation
compresses, flips, or stretches the data.
Eigenvalues can be thought of as either the strength of the transformation in the
eigenvectors' direction or the components that cause compression.

42. Which of Python and R would you use for text analytics, and why?

Python will outperform R in text analytics for the following reasons:


● Pandas is a Python package that provides simple data structures as well as
high-performance data analysis tools.
● Python is more efficient in all sorts of text analytics.
● R is more suited to machine learning than text analysis.
43. Please describe the purpose of A/B testing.

A/B testing is a statistical hypothesis test used in a randomised experiment with two
variables, A and B. By recognising any modifications to a webpage, the purpose of
A/B Testing is to maximise the possibility of an outcome of some interest.

A/B Testing is a highly reliable method for determining the best online marketing
and promotional strategies for a company. It can be used to test anything from sales
emails to search ads and website copy.

44. What are the differences between univariate, bivariate, and multivariate analysis?

Univariate, bivariate, and multivariate analyses are the three types of analyses.
● Univariate analysis refers to descriptive statistical analysis procedures that are
distinguished by the number of variables involved. A single variable can be
used in some pie charts.
● At the same time, bivariate analysis explains the difference between two
variables. A scatterplot can be used to examine the amount of sales and
spending.
● More than two variables are used in multivariate analysis, which explains the
impact of variables on answers.

45. How do you choose the best K value for K-means clustering?

The elbow technique and the kernel approach are two ways for determining the
number of centroids in a given cluster. However, we can rapidly determine the
approximate number of centroids by taking the square root of the number of data
points divided by two. While this technique isn't completely accurate, it is faster than
the others.

46. What role does Hadoop play in Data Science?

Hadoop allows data scientists to work with vast amounts of unstructured data.
Furthermore, new Hadoop extensions like Mahout and PIG offer a variety of
functionalities for analysing and implementing machine learning algorithms on
massive data sets. As a result, Hadoop is a complete system capable of processing a
wide range of data types, making it an ideal tool for data scientists.
47. Could you provide me with some examples of NoSQL databases?

Redis, MongoDB, Cassandra, HBase, Neo4j, and other NoSQL databases are some of
the most popular.

48. What distinguishes SQL from NoSQL?

Relational Database Management Systems, or RDBMS, are the subject of SQL. This
sort of database holds structured data in the form of a table, which is organised in
rows and columns. NoSQL, on the other hand, is a query language for
Non-Relational Database Management Systems. The information on this page is
unorganised. Services, devices, and software systems are the most common sources
of structured data. Unstructured data, on the other hand, is generated directly by
consumers and is growing by the day.

49. What exactly do you mean when you say "data integrity"?

Data integrity allows us to define the data's accuracy as well as consistency. This
integrity must be maintained throughout the life of the product.

50. What is the definition of a foreign key?

A foreign key is a unique key that belongs to one database but can also be used as a
main key in another. We reference the foreign key with the primary key of the other
table to create a relationship between the two tables.

51. Compare and contrast the DELETE and TRUNCATE commands.

To delete some rows from a table, use the DELETE command in conjunction with the
WHERE clause. This action has the ability to be reversed.

TRUNCATE, on the other hand, is used to delete all the rows in a table, and this
action cannot be reversed.

52. What are the different forms of joins in a table?


Some of the different joins in a table are
● Inner Join
● Left Join
● Outer Join
● Full Join
● Self Join
● Cartesian Join

53. What is the Python pickle module?

The pickle module is used to serialise and deserialize objects in Python. Pickle is
used to save this object to the hard drive. It converts a character stream from an
object structure.

54. When it comes to recall and precision, what's the difference?

The fraction of instances that have been categorised as true is known as recall.
Precision, on the other hand, is a metric for weighing instances that are genuinely
true. Precision is a genuine value that shows factual information, whereas recall is an
estimate.

55. What does it mean to have the "curse of dimensionality"? What are our options
for resolving it?

There are times when the amount of variables or columns in the dataset is excessive
while evaluating it. We are, however, only allowed to extract significant variables
from the group. Consider the fact that there are a thousand features. However, we
only need to extract a few key characteristics. The ‘curse of dimensionality' refers to
the dilemma of having many features when only a few are required.
There are a variety of dimensionality reduction algorithms available, such as PCA
(Principal Component Analysis).

56. What exactly is the box-cox transformation?

The Box Cox Transformation is used to convert the response variable so that the data
matches the required assumptions. This technique can be used to convert
non-normal dependent variables into normal shapes. With the aid of this
transformation, we can run a larger number of tests.
57. Why don't gradient descent algorithms always lead to the same result?

This is due to the fact that they sometimes reach a local or local optima point. The
methods aren't always successful in achieving global minima. This is also reliant on
the data, the velocity of fall, and the point of descent's origin.

58. How do you use the p-value to calculate significance?

We compute the significance of the results after doing a hypothesis test. Between 0
and 1, the p-value is present. If the p-value is less than 0.05, it means that the null
hypothesis cannot be rejected. If it is larger than 0.05, however, the null hypothesis is
rejected.

59. What distinguishes Deep Learning from Machine Learning?

Machine Learning is a subset of Deep Learning. It's a subfield of machine learning


that focuses on creating algorithms that replicate the human nervous system. Deep
Learning entails the use of neural networks that have been trained on massive
datasets to understand patterns and then conduct classification and prediction.

60. What exactly is SVM? Can you name some of the SVM kernels?

Support vector machine (SVM) is an acronym for support vector machine. They're
utilised for jobs like categorization and prediction. SVM is made up of a separation
plane that separates the two types of variables. Hyperplane is the term for this
separating plane. The following are some of the kernels used in SVM:
● Polynomial Kernel
● Gaussian Kernel
● Laplace RBF Kernel
● Sigmoid Kernel
● Hyperbolic Kernel

61. What is a confusion matrix, and how does it work?

A confusion matrix is a table that shows how well a supervised learning system
performs. It gives a summary of categorization problem prediction outcomes. You
can use the confusion matrix to not only determine the predictor's errors, but also the
types of errors.

62. Explain the tradeoff between bias and variance.

Underfitting is a phenomenon caused by bias. This is due to the introduction of


mistakes as a result of the model's oversimplification. Variance, on the other hand, is
caused by the complexity of the machine learning algorithm. In addition, the model
learns noise and other aberrations that have an impact on its overall performance.
The error will decrease as the complexity of your model increases due to the
reduction in bias. However, if the system becomes more complicated and noise is
introduced, the inaccuracy will increase. The tradeoff between bias and variance is
known as bias-variance tradeoff. Low bias and variance are important characteristics
of a good machine learning algorithm.

63. Why is Naive Bayes called Naive Bayes?

The assumptions and probabilities generated for the characteristics in Naive Bayes
are independent of each other. The premise of feature independence is what
distinguishes Naive Bayes.

64. What is the difference between AUC and ROC?

The AUC curve is a comparison of precision and recall. Precision is calculated using
the formulas TP/(TP + FP) and TP/(TP + FN). ROC, on the other hand, measures
and plots True Positive against False Positive Rate.

65. Explain ROC curve.

Receiver Operating Characteristics is a measurement of the True Positive Rate (TPR)


against False Positive Rate (FPR). We calculate True Positive (TP) as TPR = TP/ (TP +
FN). On the contrary, the false positive rate is determined as FPR = FP/FP+TN
where TP = true positive, TN = true negative, FP = false positive, FN = false
negative.

66. What exactly is clustering?


Clustering is the process of categorising data points into groups. The division is
made in such a way that all of the data points in the same group are more similar to
one another than data points in other groups. Hierarchical clustering, K means
clustering, density-based clustering, fuzzy clustering, and other types of clustering
exist.

67. What data mining packages are available in R?

Dplyr- data manipulation, Ggplot2- data visualisation, purrr- data wrangling,


Hmisc- data analysis, datapasta- data import, and other popular R data mining
packages

68. What are some of the most common data quality issues encountered when
working with Big Data?

When dealing with big data, some of the major quality issues are duplicate data,
incomplete data, inconsistent data format, incorrect data, volume of data (big data),
lack of a proper storage mechanism, and so on.

69. What exactly is a confusion matrix?

A confusion matrix is a table that visualises the performance of a classification


algorithm on a set of test data with known true values.

70. What is the distinction between a tree map and a heat map?

A heat map is a type of visualisation tool that uses colours and size to compare
different categories. It is useful for comparing two different measures. The ‘tree map'
is a type of chart that depicts hierarchical data or part-to-whole relationships.

71. What is the difference between data disaggregation and data aggregation?

Aggregation is the process of combining multiple rows of data in a single location


from a low level to a higher level. Disaggregation, on the other hand, is the process
of breaking down aggregate data to a more granular level.
72. What exactly is scientific visualisation? What distinguishes it from other
visualisation techniques?

Scientific visualisation is the graphic representation of data in order to gain insight


from it. It is also referred to as visual data analysis. This aids in understanding the
system, which can now be studied in previously unimaginable ways.

73. What are some of the drawbacks of visualisation?

A few disadvantages of visualisation are as follows: it provides estimates rather than


exact results, different groups of people may interpret it differently, and poor design
can lead to confusion.

74. Scatterplot matrices represent what kind of data?

Scatterplot matrices are the most commonly used method for visualising
multidimensional data. It is used to depict bivariate relationships between a set of
variables.

75. What exactly is a hyperbolic tree?

A hyperbolic tree, also known as a hypertree, is a method of information


visualisation and graph drawing inspired by hyperbolic geometry.

76. How would you describe data operations in Pandas?

Data cleaning, data preprocessing, data transformation, data standardisation, data


normalisation, and data aggregation are all common data operations in Pandas.

77. How do you define GroupBy in Pandas?

groupby is a special function in pandas that is used to group rows together given
specific columns that contain information for categories used for data grouping.

78. How to convert the index of a series into a column of a dataframe?


df = df.reset_index() will convert the index to a column in a pandas dataframe.

79. What exactly is data aggregation?

Data aggregation is a process in which aggregate functions are used to obtain the
required results following a groupby. Sum, count, avg, max, and min are examples of
common aggregation functions.

80. What exactly is the Pandas Index?

An index is a unique number that is used to number rows in a pandas dataframe.

81. What Is A Recommender System?

A recommender system is today widely deployed in multiple fields like movie


recommendations, music preferences, social tags, research articles, search queries
and so on. The recommender systems work as per collaborative and content-based
filtering or by deploying a personality-based approach. This type of system works
based on a person’s past behavior in order to build a model for the future. This will
predict the future product buying, movie viewing or book reading by people. It also
creates a filtering approach using the discrete characteristics of items while
recommending additional items.

82. Compare Sas, R And Python Programming?

SAS: it is one of the most widely used analytics tools used by some of the biggest
companies on earth. It has some of the best statistical functions, graphical user
interface, but can come with a price tag and hence it cannot be readily adopted by
smaller enterprises

R: The best part about R is that it is an Open Source tool and hence used generously
by academia and the research community. It is a robust tool for statistical
computation, graphical representation and reporting. Due to its open source nature
it is always being updated with the latest features and then readily available to
everybody.

Python: Python is a powerful open source programming language that is easy to


learn, works well with most other tools and technologies. The best part about Python
is that it has innumerable libraries and community created modules making it very
robust. It has functions for statistical operation, model building and more.

83. Explain The Various Benefits Of R Language?

The R programming language includes a set of software suites that is used for
graphical representation, statistical computing, data manipulation and calculation.

Some of the highlights of R programming environment include the following:

● An extensive collection of tools for data analysis


● Operators for performing calculations on matrix and array
● Data analysis technique for graphical representation
● A highly developed yet simple and effective programming language
● It extensively supports machine learning applications
● It acts as a connecting link between various software, tools and datasets
● Create high quality reproducible analysis that is flexible and powerful
● Provides a robust package ecosystem for diverse needs
● It is useful when you have to solve a data-oriented problem

84. How Do Data Scientists Use Statistics?

Statistics helps Data Scientists to look into the data for patterns, hidden insights and
convert Big Data into Big insights. It helps to get a better idea of what the customers
are expecting. Data Scientists can learn about consumer behavior, interest,
engagement, retention and finally conversion all through the power of insightful
statistics. It helps them to build powerful data models in order to validate certain
inferences and predictions. All this can be converted into a powerful business
proposition by giving users what they want at precisely when they want it.

85. What Is Logistic Regression?

It is a statistical technique or a model in order to analyze a dataset and predict the


binary outcome. The outcome has to be a binary outcome that is either zero or one or
a yes or no.

86. Why Is Data Cleansing Important In Data Analysis?


With data coming in from multiple sources it is important to ensure that data is good
enough for analysis. This is where data cleansing becomes extremely vital. Data
cleansing extensively deals with the process of detecting and correcting data records,
ensuring that data is complete and accurate and the components of data that are
irrelevant are deleted or modified as per the needs. This process can be deployed in
concurrence with data wrangling or batch processing.

Once the data is cleaned it confirms with the rules of the data sets in the system.
Data cleansing is an essential part of data science because the data can be prone to
error due to human negligence, corruption during transmission or storage among
other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist
because of the multiple sources from which data emanates and the speed at which it
comes.

87. Describe Univariate, Bivariate And Multivariate Analysis.?

As the name suggests these are analysis methodologies having a single, double or
multiple variables.

So a univariate analysis will have one variable and due to this there are no
relationships, causes. The major aspect of the univariate analysis is to summarize the
data and find the patterns within it to make actionable decisions.

A Bivariate analysis deals with the relationship between two sets of data. These sets
of paired data come from related sources, or samples. There are various tools to
analyze such data including the chi-squared tests and t-tests when the data are
having a correlation.

If the data can be quantified then it can be analyzed using a graph plot or a scatter
plot. The strength of the correlation between the two data sets will be tested in a
Bivariate analysis.

88. What Is Root Cause Analysis?

Root cause analysis was initially developed to analyze industrial accidents, but is
now widely used in other areas. It is basically a technique of problem solving used
for isolating the root causes of faults or problems. A factor is called a root cause if its
deduction from the problem-fault-sequence averts the final undesirable event from
recurring.
89. Explain Cross-validation.?

It is a model validation technique for evaluating how the outcomes of a statistical


analysis will generalize to an independent data set. Mainly used in backgrounds
where the objective is forecast and one wants to estimate how accurately a model
will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training
phase (i.e. validation data set) in order to limit problems like over-fitting, and get an
insight on how the model will generalize to an independent data set.

90. What Is Collaborative Filtering?

The process of filtering used by most of the recommender systems to find patterns or
information by collaborating perspectives, numerous data sources and several
agents.

91. Do Gradient Descent Methods At All Times Converge To Similar Points?

No, they do not because in some cases it reaches a local minima or a local optima
point. You will not reach the global optima point. This is governed by the data and
the starting conditions.

92. What Is The Goal Of A/b Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A


and B. The objective of A/B Testing is to detect any changes to the web page to
maximize or increase the outcome of an interest.

93. What Are The Drawbacks Of Linear Models?

Some drawbacks of the linear model are:


● The assumption of linearity of the errors
● It can’t be used for count outcomes, binary outcomes
● There are overfitting problems that it can’t solve
94. What Is The Law Of Large Numbers?

It is a theorem that describes the result of performing the same experiment a large
number of times. This theorem forms the basis of frequency-style thinking. It says
that the sample mean, the sample variance and the sample standard deviation
converge to what they are trying to estimate.

95. What Are Confounding Variables?

These are extraneous variables in a statistical model that correlate directly or


inversely with both the dependent and the independent variable. The estimate fails
to account for the confounding factor.

96. Explain Star Schema.?

It is a traditional database schema with a central table. Satellite tables map ID’s to
physical name or description and can be connected to the central fact table using the
ID fields; these tables are known as lookup tables, and are principally useful in
real-time applications, as they save a lot of memory. Sometimes star schemas involve
several layers of summarization to recover information faster.

97. How Regularly An Algorithm Must Be Updated?

You want to update an algorithm when:


● You want the model to evolve as data streams through infrastructure
● The underlying data source is changing
● There is a case of non-stationarity

98. What Are Eigenvalues And Eigenvectors?

Eigenvectors are for understanding linear transformations. In data analysis, we


usually calculate the eigenvectors for a correlation or covariance matrix.
Eigenvectors are the directions along which a particular linear transformation acts by
flipping, compressing or stretching.

99. Why Is Resampling Done?


Resampling is done in one of these cases:
● Estimating the accuracy of sample statistics by using subsets of accessible data
or drawing randomly with replacement from a set of data points
● Substituting labels on data points when performing significance tests
● Validating models by using random subsets (bootstrapping, cross validation.

100. Explain Selective Bias.?

Selection bias, in general, is a problematic situation in which error is introduced due


to a non-random population sample.

101. What Are The Types Of Biases That Can Occur During Sampling?

● Selection bias
● Under coverage bias
● Survivorship bias

102. How To Work Towards A Random Forest?

● Underlying principle of this technique is that several weak learners combined


provide a strong learner. The steps involved are
● Build several decision trees on bootstrapped training samples of data
● On each tree, each time a split is considered, a random sample of mm
predictors is chosen as split candidates, out of all pp predictors
● Rule of thumb: at each split m=p√m=p
● Predictions: at the majority rule.

103. What is the Central Limit Theorem and why is it important?

“Suppose that we are interested in estimating the average height among all people.
Collecting data for every person in the world is impossible. While we can’t obtain a
height measurement from everyone in the population, we can still sample some
people. The question now becomes, what can we say about the average height of the
entire population given a single sample. The Central Limit Theorem addresses this
question exactly.”
104. What is sampling? How many sampling methods do you know?

“Data sampling is a statistical analysis technique used to select, manipulate and


analyze a representative subset of data points to identify patterns and trends in the
larger data set being examined.”

105. What is the difference between type I vs type II error?

“A type I error occurs when the null hypothesis is true, but is rejected. A type II error
occurs when the null hypothesis is false, but erroneously fails to be rejected.”

106. What is linear regression? What do the terms p-value, coefficient, and r-squared
value mean? What is the significance of each of these components?

A linear regression is a good tool for quick predictive analysis: for example, the price
of a house depends on a myriad of factors, such as its size or its location. In order to
see the relationship between these variables, we need to build a linear regression,
which predicts the line of best fit between them and can help conclude whether or
not these two factors have a positive or negative relationship.

107. What are the assumptions required for linear regression?

There are four major assumptions:


1. There is a linear relationship between the dependent variables and the
regressors, meaning the model you are creating actually fits the data,
2. The errors or residuals of the data are normally distributed and independent
from each other,
3. There is minimal multicollinearity between explanatory variables, and
4. Homoscedasticity. This means the variance around the regression line is the
same for all values of the predictor variable.

108. What is a statistical interaction?

”Basically, an interaction is when the effect of one factor (input variable) on the
dependent variable (output variable) differs among levels of another factor.”
109. What is selection bias?

“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that
is gathered and prepared for modeling has characteristics that are not representative
of the true, future population of cases the model will see. That is, active selection bias
occurs when a subset of the data are systematically (i.e., non-randomly) excluded
from analysis.”

110. What is an example of a data set with a non-Gaussian distribution?

“The Gaussian distribution is part of the Exponential family of distributions, but


there are a lot more of them, with the same sort of ease of use, in many cases, and if
the person doing the machine learning has a solid grounding in statistics, they can be
utilized where appropriate.”

111. What is the Binomial Probability Formula?

“The binomial distribution consists of the probabilities of each of the possible


numbers of successes in N trials for independent events that each have a probability
of π (the Greek letter pi) of occurring.”

112. What is logistic regression in Data Science?

Logistic Regression is also called the logit model. It is a method to forecast the binary
outcome from a linear combination of predictor variables.

113. Name three types of biases that can occur during sampling

In the sampling process, there are three types of biases, which are:
● Selection bias
● Under coverage bias
● Survivorship bias

114. Discuss Decision Tree algorithm


A decision tree is a popular supervised machine learning algorithm. It is mainly
used for Regression and Classification. It breaks down a dataset into smaller subsets.
The decision tree is able to handle both categorical and numerical data.

115. What is Prior probability and likelihood?

Prior probability is the proportion of the dependent variable in the data set while the
likelihood is the probability of classifying a given observant in the presence of some
other variable.

116. Explain Recommender Systems?

It is a subclass of information filtering techniques. It helps you to predict the


preferences or ratings which users are likely to give to a product.

117. Name three disadvantages of using a linear model

Three disadvantages of the linear model are:


● The assumption of linearity of the errors.
● You can't use this model for binary or count outcomes
● There are plenty of overfitting problems that it can't solve

118. Why do you need to perform resampling?

Resampling is done in below-given cases:


● Estimating the accuracy of sample statistics by drawing randomly with
replacement from a set of the data point or using as subsets of accessible data
● Substituting labels on data points when performing necessary tests
● Validating models by using random subsets

119. List out the libraries in Python used for Data Analysis and Scientific
Computations.
● SciPy
● Pandas
● Matplotlib
● NumPy
● SciKit
● Seaborn

120. What is Power Analysis?

The power analysis is an integral part of the experimental design. It helps you to
determine the sample size required to find out the effect of a given size from a cause
with a specific level of assurance. It also allows you to deploy a particular
probability in a sample size constraint.

121. Explain Collaborative filtering

Collaborative filtering used to search for correct patterns by collaborating


viewpoints, multiple data sources, and various agents.

122. What is bias?

Bias is an error introduced in your model because of the oversimplification of a


machine learning algorithm." It can lead to underfitting.

123. Discuss 'Naive' in a Naive Bayes algorithm?

The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the
probability of an event. It is based on prior knowledge of conditions which might be
related to that specific event.

124. What is a Linear Regression?

Linear regression is a statistical programming method where the score of a variable


'A' is predicted from the score of a second variable 'B'. B is referred to as the predictor
variable and A as the criterion variable.

125. State the difference between the expected value and mean value
There are not many differences, but both of these terms are used in different
contexts. Mean value is generally referred to when you are discussing a probability
distribution whereas expected value is referred to in the context of a random
variable.

126. What is the aim of conducting A/B Testing?

AB testing used to conduct random experiments with two variables, A and B. The
goal of this testing method is to find out changes to a web page to maximize or
increase the outcome of a strategy.

127. What is Ensemble Learning?

The ensemble is a method of combining a diverse set of learners together to


improvise on the stability and predictive power of the model.

Two types of Ensemble learning methods are:

Bagging
Bagging method helps you to implement similar learners on small sample
populations. It helps you to make nearer predictions.

Boosting
Boosting is an iterative method which allows you to adjust the weight of an
observation depending upon the last classification. Boosting decreases the bias error
and helps you to build strong predictive models.

128. Explain Eigenvalue and Eigenvector

Eigenvectors are for understanding linear transformations. Data scientists need to


calculate the eigenvectors for a covariance matrix or correlation. Eigenvalues are the
directions along using specific linear transformation acts by compressing, flipping,
or stretching.

129. Define the term cross-validation

Cross-validation is a validation technique for evaluating how the outcomes of


statistical analysis will generalize for an Independent dataset. This method is used in
backgrounds where the objective is forecast, and one needs to estimate how
accurately a model will accomplish.

130. Explain the steps for a Data analytics project

The following are important steps involved in an analytics project:


● Understand the Business problem
● Explore the data and study it carefully.
● Prepare the data for modeling by finding missing values and transforming
variables.
● Start running the model and analyze the Big data result.
● Validate the model with a new data set.
● Implement the model and track the result to analyze the performance of the
model for a specific period.

131. Discuss Artificial Neural Networks

Artificial Neural networks (ANN) are a special set of algorithms that have
revolutionized machine learning. It helps you to adapt according to changing input.
So the network generates the best possible result without redesigning the output
criteria.

132. What is Back Propagation?

Back-propagation is the essence of neural net training. It is the method of tuning the
weights of a neural net depending upon the error rate obtained in the previous
epoch. Proper tuning of the helps you to reduce error rates and to make the model
reliable by increasing its generalization.

133. What is a Random Forest?

Random forest is a machine learning method which helps you to perform all types of
regression and classification tasks. It is also used for treating missing values and
outlier values.

134. What is the importance of having a selection bias?


Selection Bias occurs when there is no specific randomization achieved while picking
individuals or groups or data to be analyzed. It suggests that the given sample does
not exactly represent the population which was intended to be analyzed.

135. What is the K-means clustering method?

K-means clustering is an important unsupervised learning method. It is the


technique of classifying data using a certain set of clusters which is called K clusters.
It is deployed for grouping to find out the similarity in the data.

136. Explain the difference between Data Science and Data Analytics

Data Scientists need to slice data to extract valuable insights that a data analyst can
apply to real-world business scenarios. The main difference between the two is that
the data scientists have more technical knowledge than business analysts. Moreover,
they don't need an understanding of the business required for data visualization.

137. Explain p-value?

When you conduct a hypothesis test in statistics, a p-value allows you to determine
the strength of your results. It is a numerical number between 0 and 1. Based on the
value it will help you to denote the strength of the specific result.

138. Define the term deep learning

Deep Learning is a subtype of machine learning. It is concerned with algorithms


inspired by the structure called artificial neural networks (ANN).

139. Explain the method to collect and analyze data to use social media to predict the
weather condition.

You can collect social media data using Facebook, twitter, Instagram's API. For
example, for the tweeter, we can construct a feature from each tweet like tweeted
date, retweets, list of followers, etc. Then you can use a multivariate time series
model to predict the weather condition.
140. When do you need to update the algorithm in Data science?

You need to update an algorithm in the following situation:


● You want your data model to evolve as data streams using infrastructure
● The underlying data source is changing
● If it is non-stationarity

141. What is Normal Distribution

A normal distribution is a set of a continuous variable spread across a normal curve


or in the shape of a bell curve. You can consider it as a continuous probability
distribution which is useful in statistics. It is useful to analyze the variables and their
relationships when we are using the normal distribution curve.

142. Which language is best for text analytics? R or Python?

Python will be more suitable for text analytics as it consists of a rich library known
as pandas. It allows you to use high-level data analysis tools and data structures,
while R doesn't offer this feature.

143. Explain the benefits of using statistics by Data Scientists

Statistics help Data scientists to get a better idea of customer's expectation. Using the
statistical method Data Scientists can get knowledge regarding consumer interest,
behavior, engagement, retention, etc. It also helps you to build powerful data models
to validate certain inferences and predictions.

144. Name various types of Deep Learning Frameworks

● Pytorch
● Microsoft Cognitive Toolkit
● TensorFlow
● Caffe
● Chainer
● Keras
145.Explain Auto-Encoder

Autoencoders are learning networks. It helps you to transform inputs into outputs
with fewer numbers of errors. This means that you will get output to be as close to
input as possible.

146. Define Boltzmann Machine

Boltzmann machines are a simple learning algorithm. It helps you to discover those
features that represent complex regularities in the training data. This algorithm
allows you to optimize the weights and the quantity for the given problem.

147. Explain why Data Cleansing is essential and which method you use to maintain
clean data

Dirty data often leads to the incorrect inside, which can damage the prospect of any
organization. For example, if you want to run a targeted marketing campaign.
However, our data incorrectly tell you that a specific product will be in-demand with
your target audience; the campaign will fail.

148. What is skewed Distribution & uniform distribution?

Skewed distribution occurs when data is distributed on any one side of the plot
whereas uniform distribution is identified when the data is spread is equal in the
range.

149. When underfitting occurs in a static model?

Underfitting occurs when a statistical model or machine learning algorithm is not


able to capture the underlying trend of the data.

150. What is reinforcement learning?

Reinforcement Learning is a learning mechanism about how to map situations to


actions. The end result should help you to increase the binary reward signal. In this
method, a learner is not told which action to take but instead must discover which
action offers a maximum reward. This method is basedy on the reward/penalty
mechanism.

You might also like