KEMBAR78
Data Analyst Question-Answers | PDF | Data Analysis | Apache Hadoop
100% found this document useful (1 vote)
3K views17 pages

Data Analyst Question-Answers

Data analysis is the structured process of collecting, cleaning, transforming, and assessing data to derive insights for business decision-making. Key responsibilities of a data analyst include data collection, interpretation, cleaning, and reporting, while they may encounter challenges such as poor data quality and representation issues. Essential skills for data analysts encompass analytical abilities, proficiency in programming and databases, and knowledge of statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views17 pages

Data Analyst Question-Answers

Data analysis is the structured process of collecting, cleaning, transforming, and assessing data to derive insights for business decision-making. Key responsibilities of a data analyst include data collection, interpretation, cleaning, and reporting, while they may encounter challenges such as poor data quality and representation issues. Essential skills for data analysts encompass analytical abilities, proficiency in programming and databases, and knowledge of statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

1. What is Data Analysis, in brief?

1) Data analysis is the structured procedure that involves working with data by
performing activities such as ingestion, cleaning, transforming, and assessing it to
provide insights, which can be used to drive revenue. (Data analysis is defined as a
process of cleaning, transforming, and modeling data to discover useful information
for business decision-making. The purpose of Data Analysis is to extract useful
information from data and taking the decision based upon the data analysis.)
2) Data is collected, to begin with, from varied sources. Since the data is a raw entity, it
has to be cleaned and processed to fill out missing values and to remove any entity
that is out of the scope of usage.
3) After pre-processing the data, it can be analyzed with the help of models, which use
the data to perform some analysis on it.
4) The last step involves reporting and ensuring that the data output is converted to a
format that can also cater to a non-technical audience, alongside the analysts.

Process of Data Analysis

 Collect Data: The data gets collected from various sources and is stored so that it can
be cleaned and prepared. In this step, all the missing values and outliers are removed.
 Analyse Data: Once the data is ready, the next step is to analyze the data. A model is
run repeatedly for improvements. Then, the mode is validated to check whether it
meets the business requirements.
 Create Reports: Finally, the model is implemented and then reports thus generated
are passed onto the stakeholders.

2.What are the important responsibilities of a data analyst?


 Collect and interpret data from multiple sources and analyze results.
 Filter and “clean” data gathered from multiple sources.
 Offer support to every aspect of data analysis.
 Analyze complex datasets and identify the hidden patterns in them.
 Keep databases secured.
 Implementing data visualization skills to deliver comprehensive results.
 Data preparation
 Quality Assurance
 Report generations and preparation
 Troubleshooting
 Data extraction
 Trends interpretation
3.What are the problems that a Data Analyst can encounter while performing data
analysis?
A critical data analyst interview question you need to be aware of. A Data Analyst can
confront the following issues while performing data analysis:

 Presence of duplicate entries and spelling mistakes. These errors can hamper data
quality.
 Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will
have to spend a significant amount of time in cleansing the data.
 Data extracted from multiple sources may vary in representation. Once the collected
data is combined after being cleansed and organized, the variations in data
representation may cause a delay in the analysis process.
 Incomplete data is another major challenge in the data analysis process. It would
inevitably lead to erroneous or faulty results.

4. What are the key requirements for becoming a Data Analyst

 Being well-versed in programming languages such as XML, JavaScript, and ETL


frameworks
 Proficient in databases such as SQL, MongoDB, and more
 Ability to effectively collect and analyze data
 Knowledge of database designing and data mining
 Having the ability/experience of working with large datasets

5.Write some key skills usually required for a data analyst.

Here are some key skills usually required for a data analyst:

 Analytical skills: Ability to collect, organize, and dissect data to make it meaningful.
 Mathematical and Statistical skills: Proficiency in applying the right statistical
methods or algorithms on data to get the insights needed.
 Problem-solving skills: Ability to identify issues, obstacles, and opportunities in data
and come up with effective solutions.
 Attention to Detail: Ensuring precision in data collection, analysis, and interpretation.
 Knowledge of Machine Learning: In some roles, a basic understanding of machine
learning concepts can be beneficial.

6.What are the different types of sampling techniques used by data analysts?

Sampling is a statistical method to select a subset of data from an entire dataset (population)
to estimate the characteristics of the whole population.

There are majorly five types of sampling methods:

 Simple random sampling


 Systematic sampling
 Cluster sampling
 Stratified sampling
 Judgmental or purposive sampling

7. Describe univariate, bivariate, and multivariate analysis.

1) Univariate analysis is the simplest and easiest form of data analysis where the data being
analyzed contains only one variable.

Example - Studying the heights of players in the NBA.

growth in the population of a specific city in the last 50 years.

Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar
charts, Histograms, Pie charts, and Frequency distribution tables.

2) The bivariate analysis involves the analysis of two variables to find causes, relationships,
and correlations between the variables.

Example – Analyzing the sale of ice creams based on the temperature outside.

gender-wise analysis of growth in the population of a specific city.

The bivariate analysis can be explained using Correlation coefficients, Linear regression,
Logistic regression, Scatter plots, and Box plots.

3) The multivariate analysis involves the analysis of three or more variables to understand the
relationship of each variable with the other variables.

Example – Analysing Revenue based on expenditure.

. This could be the break up of population growth in a specific city based on gender, income,
employment type, etc.

Multivariate analysis can be performed using Multiple regression, Factor analysis,


Classification & regression trees, Cluster analysis, Principal component analysis, Dual-axis

8.What are your strengths and weaknesses as a data analyst?

The answer to this question may vary from a case to case basis. However, some general
strengths of a data analyst may include strong analytical skills, attention to detail, proficiency
in data manipulation and visualization, and the ability to derive insights from complex
datasets. Weaknesses could include limited domain knowledge, lack of experience with
certain data analysis tools or techniques, or challenges in effectively communicating technical
findings to non-technical stakeholders.

9.What are the common problems that data analysts encounter during analysis?

The common problems steps involved in any analytics project are:

 Handling duplicate
 Collecting the meaningful right data and the right time
 Handling data purging and storage problems
 Making data secure and dealing with compliance issues

10. What are the various steps involved in any analytics project?

Understanding the Problem


Understand the business problem, define the organizational goals, and plan for a lucrative
solution.

Collecting Data
Gather the right data from various sources and other information based on your priorities.

Cleaning Data
Clean the data to remove unwanted, redundant, and missing values, and make it ready for
analysis.

Exploring and Analyzing Data


Use data visualization and business intelligence tools, data mining techniques, and predictive
modeling to analyze data.

Interpreting the Results


Interpret the results to find out hidden patterns, future trends, and gain insights.

11.Explain what is the criteria for a good data model?

Criteria for a good data model includes

 It can be easily consumed


 Large data changes in a good model should be scalable
 It should provide predictable performance
 A good model can adapt to changes in requirements

12.Which are the technical tools that you have used for analysis and presentation purposes?

As a data analyst, you are expected to know the tools mentioned below for analysis and
presentation purposes. Some of the popular tools you should know are:

MS SQL Server, MySQL


For working with data stored in relational databases

MS Excel, Tableau
For creating reports and dashboards
Python, R, SPSS
For statistical analysis, data modeling, and exploratory analysis

MS PowerPoint
For presentation, displaying the final results and important conclusions

12. What are the common problems that data analysts encounter during analysis?

The common problems steps involved in any analytics project are:

 Handling duplicate
 Collecting the meaningful right data and the right time
 Handling data purging and storage problems
 Making data secure and dealing with compliance issues

13.Explain Data Cleaning in brief.


1) Data Cleaning is also called Data Wrangling
2) Data cleansing primarily refers to the process of detecting and removing errors and
inconsistencies from the data to improve data quality. Although containing valuable
information, an unstructured database is hard to move through and find valuable
information. Data cleansing simplifies this process by modifying unorganized data
to keep it intact, precise, and useful.
3) Ways for Data Cleaning-

 Create a data cleaning plan by understanding where the common errors take place and keep
all the communications open.
 Before working with the data, identify and remove the duplicates. This will lead to an easy
and effective data analysis process.
 Focus on the accuracy of the data. Set cross-field validation, maintain the value types of
data, and provide mandatory constraints.
 Normalize the data at the entry point so that it is less chaotic. You will be able to ensure
that all information is standardized, leading to fewer errors on entry.
 Removing a data block entirely
 Finding ways to fill black data in, without causing redundancies
 Replacing data with its mean or median values
 Making use of placeholders for empty spaces
 egregating data, according to their respective attributes.
 Breaking large chunks of data into small datasets and then cleaning them.
 Analyzing the statistics of each data column.
 Creating a set of utility functions or scripts for dealing with common cleaning tasks.
 Keeping track of all the data cleansing operations to facilitate easy addition or
removal from the datasets, if required.
 Removal of unwanted observations which are not in reference to the filed of
study one is carrying.
 Quality Check
 Data standardisation
 Data normalisation
 Deduplication
 Data Analysis
 Exporting of data

14.Explain what is collaborative filtering?


Collaborative filtering is an algorithm that creates a recommendation system based on the
behavioral data of a user. For instance, online shopping sites usually compile a list of items
under “recommended for you” based on your browsing history and previous purchases. The
crucial components of this algorithm include users, objects, and their interests. It is used to
broaden the options the users could have. Online entertainment applications are another
example of collaborative filtering. For example, Netflix shows recommendations basis the
user’s behavior. It follows various techniques, such as-

i) Memory-based approach

ii) Model-based approach

A good example of collaborative filtering is when you see a statement like “recommended for
you” on online shopping sites that’s pops out based on your browsing history.

. 15.Explain “Normal Distribution.”


Normal distribution, better known as the Bell Curve or Gaussian curve, refers to a
probability function that describes and measures how the values of a variable are
distributed, that is, how they differ in their means and their standard deviations. In the
curve, the distribution is symmetric. While most of the observations cluster around the
central peak, probabilities for the values steer further away from the mean, tapering off
equally in both directions.

16.Name the different data validation methods used by data analysts.


There are many ways to validate datasets. Some of the most commonly used data validation
methods by Data Analysts include:

 Field Level Validation – In this method, data validation is done in each field as and
when a user enters the data. It helps to correct the errors as you go.
 Form Level Validation – In this method, the data is validated after the user
completes the form and submits it. It checks the entire data entry form at once,
validates all the fields in it, and highlights the errors (if any) so that the user can
correct it.
 Data Saving Validation – This data validation technique is used during the process
of saving an actual file or database record. Usually, it is done when multiple data
entry forms must be validated.
 Search Criteria Validation – This validation technique is used to offer the user
accurate and related matches for their searched keywords or phrases. The main
purpose of this validation method is to ensure that the user’s search queries can return
the most relevant results.

17.Explain what is logistic regression?

Logistic regression is a statistical method for examining a dataset in which there are one or
more independent variables that defines an outcome

18.Mention what are the missing patterns that are generally observed?

The missing patterns that are generally observed are

 Missing completely at random


 Missing at random
 Missing that depends on the missing value itself
 Missing that depends on unobserved input variable

19.Explain what is KNN imputation method?

In KNN imputation, the missing attribute values are imputed by using the attributes value that
are most similar to the attribute whose values are missing. By using a distance function, the
similarity of two attributes is determined.

20.Mention what are the data validation methods used by data analyst?

Usually, methods used by data analyst for data validation are

 Data screening
 Data verification

21.Explain what should be done with suspected or missing data?

 Prepare a validation report that gives information of all suspected data. It should give
information like validation criteria that it failed and the date and time of occurrence
 Experience personnel should examine the suspicious data to determine their
acceptability
 Invalid data should be assigned and replaced with a validation code
 To work on missing data use the best analysis strategy like deletion method, single
imputation methods, model based methods, etc.
22.Mention how to deal the multi-source problems?

To deal the multi-source problems,

 Restructuring of schemas to accomplish a schema integration


 Identify similar records and merge them into single record containing all relevant
attributes without redundancy

23.Explain what is an Outlier?

The outlier is a commonly used terms by analysts referred for a value that appears far away
and diverges from an overall pattern in a sample. There are two types of Outliers

 Univariate
 Multivariate

24.Explain what is Hierarchical Clustering Algorithm?

Hierarchical clustering algorithm combines and divides existing groups, creating a


hierarchical structure that showcase the order in which groups are divided or merged.

25.Explain what is K-mean Algorithm?

K mean is a famous partitioning method. Objects are classified as belonging to one of K


groups, k chosen a priori.

In K-mean algorithm,

 The clusters are spherical: the data points in a cluster are centered around that cluster
 The variance/spread of the clusters is similar: Each data point belongs to the closest
cluster

26.Explain what is KPI, design of experiments and 80/20 rule?

KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination
of spreadsheets, reports or charts about business process

Design of experiments: It is the initial process used to split your data, sample and set up of a
data for statistical analysis

80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients
27.Mention what is the difference between data mining and data profiling?

Data Profiling Data Mining


Data Profiling is a process of Data mining refers to a process
evaluating data from an existing of analyzing the gathered
source and analyzing and information and collecting
summarizing useful information insights and statistics about the
about that data. data.
It is also called data It is also known as KDD
archaeology. (Knowledge Discovery in
Databases).
It is executed on structured as Generally, it is executed on the
well as unstructured data. structured data.
It extracts the data from the The data extraction process
existing raw data. involves some computer-based
methodologies and some
algorithms.
It involves the discovery and It involves various techniques to
analytical techniques to collect perform tasks, such as
useful information related to the classification, clustering,
data. regression, association rule and
neural network.
The tools used for data The tools used for data mining
profiling are Microsoft Docs, are Orange, RapidMiner, SPSS,
IBM Information Analyzer, Rattle, Sisense, Weka, etc.
Melisa Data Profiler, etc.

28. Explain what is Map Reduce?

Map-reduce is a framework to process large data sets, splitting them into subsets, processing
each subset on a different server and then blending results obtained on each.
29.What Is Linear Regression?

Linear regression is a statistical method used to find out how two variables are related to each
other. One of the variables is the dependent variable and the other one is the explanatory
variable. The process used to establish this relationship involves fitting a linear equation to
the dataset.

30.Explain what is Clustering? What are the properties for clustering algorithms?.

Clustering is the technique of identifying groups or categories within a dataset and placing
data values into those groups, thus creating clusters.

Clustering algorithms have the following properties:

 Iterative
 Hard or soft
 Disjunctive
 Flat or hierarchical

31.Explain Data Warehousing.

A data warehouse is a data storage system that collects data from various disparate sources
and stores them in a way that makes it easy to produce important business insights. Data
warehousing is the process of identifying heterogeneous data sources, sourcing data, cleaning
it, and transforming it into a manageable form for storage in a data warehouse.

32.How Do You Tackle Missing Data in a Dataset?

There are two main ways to deal with missing data in data analysis.

Imputation is a technique of creating an informed guess about what the missing data point
could be. It is used when the amount of missing data is low and there appears to be natural
variation within the available data.

The other option is to remove the data. This is usually done if data is missing at random and
there is no way to make reasonable conclusions about what those missing values might b

33.What are some of the statistical methods that are useful for data-analyst?

Statistical methods that are useful for data scientist are

 Bayesian method
 Markov process
 Spatial and cluster processes
 Rank statistics, percentile, outliers detection
 Imputation techniques, etc.
 Simplex algorithm
 Mathematical optimization
34.What is time series analysis?

Time series analysis can be done in two domains, frequency domain and the time domain. In
Time series analysis the output of a particular process can be forecast by analyzing the
previous data by the help of various methods like exponential smoothening, log-linear
regression method, etc.

35.Explain what is correlogram analysis?

A correlogram analysis is the common form of spatial analysis in geography. It consists of a


series of estimated autocorrelation coefficients calculated for a different spatial relationship.
It can be used to construct a correlogram for distance-based data, when the raw data is
expressed as distance rather than values at individual points.

36.Explain Outlier.

An outlier is a data point that significantly differs from other similar points. It’s an
observation that lies an abnormal distance from other values in a random sample from a
population. In other words, an outlier is very much different from the “usual” data.

Depending on the context, outliers can have a significant impact on your data analysis. In
statistical analysis, outliers can distort the interpretation of the data by skewing averages and
inflating the standard deviation.

37.What are the ways to detect outliers? Explain different ways to deal with it.

Outliers can be detected in several ways, including visual methods and statistical techniques:

 Box Plots: A box plot (or box-and-whisker plot) can help you visually identify
outliers. Points that are located outside the whiskers of the box plot are often
considered outliers.
 Scatter Plots: These can be useful for spotting outliers in multivariate data.
 Z-Scores: Z-scores measure how many standard deviations a data point is from the
mean. A common rule of thumb is that a data point is considered an outlier if its z-
score is greater than 3 or less than -3.
 IQR Method: The interquartile range (IQR) method identifies as outliers any points
that fall below the first quartile minus 1.5 times the IQR or above the third quartile
plus 1.5 times the IQR.
 DBSCAN Clustering: Density-Based Spatial Clustering of Applications with Noise
(DBSCAN) is a density-based clustering algorithm, which can be used to detect
outliers in the data.

38.Write characteristics of a good data model.

Here are some characteristics of a good data model:


 Simplicity: A good data model should be simple and easy to interpret. It should have a
logical, clear structure that can be easily understood by both the developers and the
end-users.
 Robustness: A robust data model can handle a variety of data types and volumes. It
should be able to support new business requirements and changes without
necessitating major modifications.
 Scalability: The model should be designed in a way that it can efficiently handle
increases in data volume and user load. It should be able to accommodate growth over
time.
 Consistency: Consistency in a data model refers to the need for the model to be free
from contradiction and ambiguity. This ensures that the same piece of data does not
have multiple interpretations.
 Flexibility: A good data model can adapt to changes in requirements. It should allow
for easy modifications in structure when business requirements change.

39.How Would You Define a Good Data Model?

A good data model exhibits the following:

 Predictability: The data model should work in ways that are predictable so that its
performance outcomes are always dependable.
 Scalability: The data model’s performance shouldn’t become hampered when it is fed
increasingly large datasets.
 Adaptability: It should be easy for the data model to respond to changing business
scenarios and goals.
 Results-oriented: The organization that you work for or its clients should be able to
derive profitable insights using the model.

40.What is the difference between Data Mining and Data Analysis?

Data Mining Data Analysis


Used to recognize patterns in Used to order & organize raw
data stored. data in a meaningful manner.
The analysis of data involves Data
Mining is performed on clean
Cleaning. So, data is not present
and well-documented data.
in a well-documented format.
Results extracted from data
Results extracted from data
mining are not easy to
analysis are easy to interpret.
interpret.
41.What are the important steps in the data validation process?

As the name suggests Data Validation is the process of validating data. This step mainly has
two processes involved in it. These are Data Screening and Data Verification.

 Data Screening: Different kinds of algorithms are used in this step to screen the
entire data to find out any inaccurate values.
 Data Verification: Each and every suspected value is evaluated on various use-cases,
and then a final decision is taken on whether the value has to be included in the data
or not.

42. What do you think are the criteria to say whether a developed data model is good or
not?

 A model developed for the dataset should have predictable performance. This is
required to predict the future.
 A model is said to be a good model if it can easily adapt to changes according to
business requirements.
 If the data gets changed, the model should be able to scale according to the data.
 The model developed should also be able to easily consumed by the clients for
actionable and profitable results.

43. When do you think you should retrain a model? Is it dependent on the data?

Business data keeps changing on a day-to-day basis, but the format doesn’t change. As and
when a business operation enters a new market, sees a sudden rise of opposition or sees its
own position rising or falling, it is recommended to retrain the model. So, as and when the
business dynamics change, it is recommended to retrain the model with the changing
behaviors of customers.

44. Can you mention a few problems that data analyst usually encounter while
performing the analysis?

The following are a few problems that are usually encountered while performing data
analysis.

 Presence of Duplicate entries and spelling mistakes, reduce data quality.


 If you are extracting data from a poor source, then this could be a problem as you
would have to spend a lot of time cleaning the data.
 When you extract data from sources, the data may vary in representation. Now, when
you combine data from these sources, it may happen that the variation in
representation could result in a delay.
 Lastly, if there is incomplete data, then that could be a problem to perform analysis of
data.

45. What is the KNN imputation method?

This method is used to impute the missing attribute values which are imputed by the attribute
values that are most similar to the attribute whose values are missing. The similarity of the
two attributes is determined by using the distance functions.

46. Mention the name of the framework developed by Apache for processing large
dataset for an application in a distributed computing environment?

The complete Hadoop Ecosystem was developed for processing large dataset for an
application in a distributed computing environment. The Hadoop Ecosystem consists of the
following Hadoop components.
 HDFS -> Hadoop Distributed File System
 YARN -> Yet Another Resource Negotiator
 MapReduce -> Data processing using programming
 Spark -> In-memory Data Processing
 PIG, HIVE-> Data Processing Services using Query (SQL-like)
 HBase -> NoSQL Database
 Mahout, Spark MLlib -> Machine Learning
 Apache Drill -> SQL on Hadoop
 Zookeeper -> Managing Cluster
 Oozie -> Job Scheduling
 Flume, Sqoop -> Data Ingesting Services
 Solr & Lucene -> Searching & Indexing
 Ambari -> Provision, Monitor and Maintain cluster

47.What is a hash table?


A hash table, also known as a hash map, is a data structure that implements an associative
array abstract data type, a structure that can map keys to values. Hash tables use a hash
function to compute an index into an array of buckets or slots, from which the desired value
can be found.

48.What are hash table collisions? How is it avoided?

A hash table collision happens when two different keys hash to the same value. Two data
cannot be stored in the same slot in array.

To avoid hash table collision there are many techniques, here we list out two

 Separate Chaining:

It uses the data structure to store multiple items that hash to the same slot.

 Open addressing:

It searches for other slots using a second function and store item in first empty slot that is
found

49.Explain what is imputation? List out different types of imputation techniques?

During imputation we replace missing data with substituted values. The types of imputation
techniques involve are

 Single Imputation

 Hot-deck imputation: A missing value is imputed from a randomly selected similar


record by the help of punch card
 Cold deck imputation: It works same as hot deck imputation, but it is more advanced
and selects donors from another datasets
 Mean imputation: It involves replacing missing value with the mean of that variable
for all other cases
 Regression imputation: It involves replacing missing value with the predicted values
of a variable based on other variables
 Stochastic regression: It is same as regression imputation, but it adds the average
regression variance to regression imputation

 Multiple Imputation

 Unlike single imputation, multiple imputation estimates the values multiple times

50.Which imputation method is more favorable?

Although single imputation is widely used, it does not reflect the uncertainty created by
missing data at random. So, multiple imputation is more favorable then single imputation in
case of data missing at random.

51.Explain what is n-gram?

N-gram:

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is


a type of probabilistic language model for predicting the next item in such a sequence in the
form of a (n-1).

52.What do data analysts do?

Example:
“In general, data analysts collect, run and crunch data for insight that helps their company
make good decisions. They look for correlations and must communicate their results well.
Part of the job is also using the data to spot opportunities for preventative measures. That
requires critical thinking and creativity.”

53.Which data analysis software are you well-versed in?

Example:
“I have a breadth of software experience. For example, at my current employer, I do a lot of
ELKI data management and data mining algorithms. I can also create databases in Access
and make tables in Excel.”

54.What was your most difficult data analyst project?

Example:
“My most difficult project was on endangered animals. I had to predict how many of [animal]
would survive to 2020, 2050 and 2100. Before this, I’d dealt with data that was already there,
with events that had already happened. So, I researched the various habitats, the animal’s
predators and other factors, and did my predictions. I have high confidence in the results.”

55.What is your process when you start a new project?


Example:
“My first step is to take some time to look over the project so that I can define the objective
or problem. If I’m having a hard time figuring that part out, I reach out to the client. Next, I
feel out the data to see what’s there, how reliable it is and where it comes from. I think about
what could be the best way to model it and whether the project deadline seems to work.”

56. Why did you go into data analysis?

Example:
A data analyst’s job is to take data and use it to help companies make better business
decisions. I’m good with numbers, collecting data, and market research. I chose this
role because it encompasses the skills I’m good at, and I find data and marketing
research interesting.

”How Do You Differentiate Between a Data Lake and a Data Warehouse?

A data lake is a large volume of raw data that is unstructured and unformatted. A data
warehouse is a data storage structure that contains data that has been cleaned and processed
into a form where it can be used to easily generate valuable insights.

57.How Do You Differentiate Between Overfitting and Underfitting?

Underfitting and overfitting are both modeling errors.

Overfitting occurs when a model begins to describe the noise or errors in a dataset instead of
the important relationships between data points. Underfitting occurs when a model isn’t able
to find any trends in a given dataset at all because an inappropriate model has been applied to
it.

58.What Is the Difference Between Variance, Covariance, and Correlation?

1) Variance is the measure of how far from the mean is each value in a dataset. The
higher the variance, the more spread the dataset. This measures magnitude.
2) Covariance is the measure of how two random variables in a dataset will change
together. If the covariance of two variables is positive, they move in the same
direction, else, they move in opposite directions. This measures direction.
3) Correlation is the degree to which two random variables in a dataset will change
together. This measures magnitude and direction. The covariance will tell you
whether or not the two variables move, the correlation coefficient will tell you by
what degree they’ll move.

59.What Is a Normal Distribution?

A normal distribution, also called Gaussian distribution, is one that is symmetric about the
mean. This means that half the data is on one side of the mean and half the data on the other.
Normal distributions are seen to occur in many natural situations, like in the height of a
population, which is why it has gained prominence in the world of data analysis.

60.Do Analysts Need Version Control?


Yes, data analysts should use version control when working with any dataset. This ensures
that you retain original datasets and can revert to a previous version even if a new operation
corrupts the data in some way. Tools like Pachyderm and Dolt can be used for creating
versions of datasets.

61.Can a Data Analyst Highlight Cells Containing Negative Values in an Excel Sheet?

Yes, it is possible to highlight cells with negative values in Excel. Here’s how to do that:

1. Go to the Home option in the Excel menu and click on Conditional Formatting.
2. Within the Highlight Cells Rules option, click on Less Than.
3. In the dialog box that opens, select a value below which you want to highlight cells.
You can choose the highlight color in the dropdown menu.
4. Hit OK.

62. Which Skills and Qualities Make a Good Data Analyst?


"I believe having a strong technical background and knowledge of database tools is a good
foundation. But data analysts should also have an eye for detail, be curious and analytical,
and be able to interpret data in original ways."

63.Which Step in the Data Analysis Process Is Your Strongest?


"I'm particularly good at sorting and filtering data sets by defined variables. Breaking data
down into categories that are relevant to our KPIs makes it easier for me to track and
retrieve data and to look for trends and create visual reports

Ekda read krun ghe..

https://www.simplilearn.com/tutorials/data-analytics-tutorial/data-analyst-interview-questions

https://www.edureka.co/blog/interview-questions/data-analyst-interview-questions/

You might also like