Data Analyst Essentials Guide
Data Analyst Essentials Guide
Provide support to all data analysis and coordinate with customers and staffs
Resolve business associated issues for clients and performing audit on data
Analyze results and interpret data using statistical techniques and provide ongoing reports
Prioritize business needs and work closely with management and information needs
Acquire data from primary or secondary data sources and maintain databases/data systems
Robust knowledge on reporting packages (Business Objects), programming language (XML, Javascript, or
ETL frameworks), databases (SQL, SQLite, etc.)
Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy
Technical knowledge in database design, data models, data mining and segmentation techniques
Strong knowledge on statistical packages for analyzing large datasets (SAS, Excel, SPSS, etc.)
Problem definition
Data exploration
Data preparation
Modelling
Validation of data
Data cleaning also referred as data cleansing, deals with identifying and removing errors and
inconsistencies from data in order to enhance the quality of data.
For large datasets cleanse it stepwise and improve the data with each step until you achieve a good data
quality
For large datasets, break them into small data. Working with less data will increase your iteration speed
To handle common cleansing task create a set of utility functions/tools/scripts. It might include,
remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking out all
values that don’t match a regex
If you have an issue with data cleanliness, arrange them by estimated frequency and attack the most
common problems
Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)
Keep track of every date cleaning operation, so you can alter changes or remove operations if required
ID-100353945
Logistic regression is a statistical method for examining a dataset in which there are one or more
independent variables that defines an outcome.
RapidMiner
OpenRefine
KNIME
Solver
NodeXL
io
Wolfram Alpha’s
8) Mention what is the difference between data mining and data profiling?
Data profiling: It targets on the instance analysis of individual attributes. It gives information on various
attributes like value range, discrete value and their frequency, occurrence of null values, data type,
length, etc.
Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence
discovery, relation holding between several attributes, etc.
Common misspelling
Duplicate entries
Missing values
Illegal values
Hadoop and MapReduce is the programming framework developed by Apache for processing large data
set for an application in a distributed computing environment.
11) Mention what are the missing patterns that are generally observed?
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most
similar to the attribute whose values are missing. By using a distance function, the similarity of two
attributes is determined.
13) Mention what are the data validation methods used by data analyst?
Data screening
Data verification
Prepare a validation report that gives information of all suspected data. It should give information like
validation criteria that it failed and the date and time of occurrence
Experience personnel should examine the suspicious data to determine their acceptability
Invalid data should be assigned and replaced with a validation code
To work on missing data use the best analysis strategy like deletion method, single imputation methods,
model based methods, etc.
15) Mention how to deal the multi-source problems?
Database knowledge
Database management
Data blending
Querying
Data manipulation
Predictive Analytics
Basic descriptive statistics
Predictive modeling
Advanced analytics
Big Data Knowledge
Big data analytics
Unstructured data analysis
Machine learning
Presentation skill
Data visualization
Insight presentation
Report design
20) Explain what is collaborative filtering?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each
subset on a different server and then blending results obtained on each.
A good example of collaborative filtering is when you see a statement like “recommended for you” on
online shopping sites that’s pops out based on your browsing history.
Hadoop
Hive
Pig
Flume
Mahout
Sqoop
22) Explain what is KPI, design of experiments and 80/20 rule?
KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination of
spreadsheets, reports or charts about business process
Design of experiments: It is the initial process used to split your data, sample and set up of a data for
statistical analysis
80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients
24) Explain what is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into
natural groups or clusters.
Hierarchical or flat
Iterative
Hard and soft
Disjunctive
25) What are some of the statistical methods that are useful for data-analyst?
Bayesian method
Markov process
Spatial and cluster processes
Rank statistics, percentile, outliers detection
Imputation techniques, etc.
Simplex algorithm
Mathematical optimization
26) What is time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In Time
series analysis the output of a particular process can be forecast by analyzing the previous data by the
help of various methods like exponential smoothening, log-linear regression method, etc.
In computing, a hash table is a map of keys to values. It is a data structure used to implement an
associative array. It uses a hash function to compute an index into an array of slots, from which desired
value can be fetched.
A hash table collision happens when two different keys hash to the same value. Two data cannot be
stored in the same slot in array.
To avoid hash table collision there are many techniques, here we list out two
Separate Chaining:
It uses the data structure to store multiple items that hash to the same slot.
Open addressing:
It searches for other slots using a second function and store item in first empty slot that is found
29) Explain what is imputation? List out different types of imputation techniques?
During imputation we replace missing data with substituted values. The types of imputation techniques
involve are
Single Imputation
Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of
punch card
Cold deck imputation: It works same as hot deck imputation, but it is more advanced and selects donors
from another datasets
Mean imputation: It involves replacing missing value with the mean of that variable for all other cases
Regression imputation: It involves replacing missing value with the predicted values of a variable based
on other variables
Stochastic regression: It is same as regression imputation, but it adds the average regression variance to
regression imputation
Multiple Imputation
Unlike single imputation, multiple imputation estimates the values multiple times
Although single imputation is widely used, it does not reflect the uncertainty created by missing data at
random. So, multiple imputation is more favorable then single imputation in case of data missing at
random.
To fulfill the responsibilities a data analyst must possess a vast and rich skillset:
o Structure discovery
o Content discovery
o Relationship discovery
Structure discovery
Play Videox
Content discovery
Relationship discovery
Cross Profiling
It counts how many times every value appears within each column in a table. It helps to
discover the trends and patterns within the data.
Cross column
The primary purpose of this method is to look across the column to perform key and
dependency analysis. Key analysis scans the total values in a table to place a potential
primary key. Dependency analysis finds the relationships within the sets of data. Both
these analyses find the relationships and dependencies within a table.
Cross table profiling looks across tables to identify the potential foreign keys. It helps
find the differences and similarities in syntax and data types between tables to
determine which data might be redundant and which could be mapped together.
Data modeling involves various methods and techniques. Here are two
approaches to building data models (a third option is a combination of both):
The GoodData LDM’s key feature is that users can interact with the LDM
without needing to know how to query the PDM — the required knowledge for
the analytics used for database technologies and SQL are removed. Only the
business intelligence engineers have to directly interact with the physical data
model. The LDM helps and assists users in answering analytical questions
without preparing datasets specific to use cases. The LDM contains entities
based on real-life objects separately, as well as the relationships among them
and their specified attributes. It ensures that different analysis results make
sense.
The writer goes on to define the four criteria of a good data model: “ (1) Data in a good model
can be easily consumed. (2) Large data changes in a good model are scalable. (3) A good
model provides predictable performance. (4)A good model can adapt to changes in
requirements, but not at the expense of 1-3.”
Time series analysis is a specific way of analyzing a sequence of data points collected
over an interval of time. In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data points intermittently or
randomly.
A PivotTable is an interactive way to quickly summarize large amounts of data. You can
use a PivotTable to analyze numerical data in detail, and answer unanticipated questions about
your data. A PivotTable is especially designed for: Querying large amounts of data in many
user-friendly ways.
32) Explain what is the criteria for a good data model?
Cash, stocks, bonds, mutual funds, and bank deposits are all are examples of financial
assets. Unlike land, property, commodities, or other tangible physical assets, financial assets do
not necessarily have inherent physical worth or even a physical form.
Data modeling is the process of diagramming data flows. When creating a new or alternate database
structure, the designer starts with a diagram of how data will flow into and out of the database. This
flow diagram is used to define the characteristics of the data formats, structures, and database handling
functions to efficiently support the data flow requirements. After the database has been built and
deployed, the data model lives on to become the documentation and justification for why the database
exists and how the data flows were designed.
This is because the three data models address the use of data assets from different degrees of
abstraction. The models increase in complexity – starting with conceptual, through to physical data
models. The models are used in different stages of the development process to foster the alignment
of business goals and requirements with how data resources are used.
Conceptual data models are used to communicate business structures and concepts at a high level of
abstraction. These models are constructed without taking system constraints into account and are usually
developed by business stakeholders and data architects to define and organize the information that is
needed to develop a system.
Logical data models are concerned with the types, attributes, and relationships of the entities that will
inhabit the system. A logical model is often created by a data architect and used by business analysts. The
goal is to develop a platform-independent representation of the entities and their relationships. This stage
of data modeling provides organizations with insight pertaining to the limitations of their current
technologies.
Physical data models are used to define the implementation of logical data models employing a particular
database management system (DBMS). They are built with the current – or expected to be – technological
capabilities. Database developers and analysts work with physical data models to enact the ideas and
processes refined by conceptual and logical models.
Advantages of Big Data
1. Better Decision Making
Companies use big data in different ways to improve their B2B operations, advertising,
and communication. Many businesses including travel, real estate, finance, and
insurance are mainly using big data to improve their decision-making capabilities. Since
big data reveals more information in a usable format, businesses can utilize that data to
make accurate decisions on what consumers want or not and their behavioral
tendencies.
Big data facilitates the decision-making process by providing business intelligence and
advanced analytical insights. The more customer data a business has, the more
detailed overview it can gain about its target audience.
Data-driven insights reveal business trends and behaviors and allow companies to
expand and compete by optimizing their decision-making. Furthermore, these insights
enable businesses to create more tailored products and services, strategies, and well-
informed campaigns to compete within their industry.
2. Reduce costs of business processes
The surveys conducted by New Vantage and Syncsort (now Precisely) reveals that big
data analytics has helped businesses to reduce their expenses significantly. 66.7% of
survey respondents from New Vantage claimed that they have started using big data to
reduce expenses. Furthermore, 59.4% of survey respondents from Syncsort claimed
that big data tools helped them reduce costs and increase operational efficiency.
Do you know that Big data analytics tools like Cloud-Based Analytics and Hadoop can
help reduce costs for storing big data?
3. Fraud Detection
Financial companies, in particular, use big data to detect fraud. Data analysts
use machine learning algorithms and artificial intelligence to detect anomalies and
transaction patterns. These anomalies of transaction patterns indicate something is out
of order or a mismatch giving us clues about possible frauds.
Fraud detection is significantly important for credit unions, banks, credit card companies
to identify account information, materials, or product access. Any industry, including
finance, can better serve its customers by early identification of frauds before something
goes wrong.
For instance, credit card companies and banks can spot fraudulent purchases or stolen
credit cards using big data analytics even before the cardholder notices that something
is wrong.
4. Increased productivity
According to a survey from Syncsort, 59.9% of survey respondents have claimed that
they were using big data analytics tools like Spark and Hadoop to increase productivity.
This increase in productivity has, in turn, helped them to improve customer retention
and boost sales.
Modern big data tools help data scientists and analysts to analyze a large amount of
data efficiently, enabling them to have a quick overview of more information. This also
increases their productivity levels.
5. Improved customer service
Since big data analytics provide businesses with more information, they can utilize that
data to create more targeted marketing campaigns and special, highly personalized
offers to each individual client.
The major sources of big data are social media, email transactions, customers’ CRM
(customer relationship management) systems, etc. So, it exposes a wealth of
information to businesses about their customers’ pain points, touchpoints, values, and
trends to serve their customers better.
Moreover, big data helps companies understand how their customers think and feel and
thereby offer them more personalized products and services. Offering a personalized
experience can improve customer satisfaction, enhance relationships, and, most of all,
build loyalty.
6. Increased agility
On top of that, having huge data sets at disposal allows companies to improve
communications, products, and services and reevaluate risks. Besides, big data helps
companies improve their business tactics and strategies, which are very helpful in
aligning their business efforts to support frequent and faster changes in the industry.
Disadvantages
1. Lack of talent
According to a survey by AtScale, the lack of big data experts and data scientists has
been the biggest challenge in this field for the past three years. Currently, many IT
professionals don’t know how to carry out big data analytics as it requires a different
skill set. Thus, finding data scientists who are also experts in big data can be
challenging.
Big data experts and data scientists are two highly paid careers in the data science
field. Therefore, hiring big data analysts can be very expensive for companies,
especially for startups. Some companies have to wait for a long time to hire the required
staff to continue their big data analytics tasks.
2. Security risks
Most of the time, companies collect sensitive information for big data analytics. Those
data need protection, and security risks can be demerits due to the lack of proper
maintenance.
Besides, having access to huge data sets can gain unwanted attention from hackers,
and your business may be a target of a potential cyber-attack. As you know, data
breaches have become the biggest threat to many companies today.
Another risk with big data is that unless you take all necessary precautions, important
information can be leaked to competitors.
3. Compliance
The need to have compliance with government legislation is also a drawback of big
data. If big data contains personal or confidential information, the company should make
sure that they follow government requirements and industry standards to store, handle,
maintain, and process that data.
So, data governance tasks, transmission, and storage will become more difficult to
manage as the big data volumes increase.
Conclusion
ETL, which stands for “extract, transform, load,” are the three processes that, in
combination, move data from one database, multiple databases, or other sources to a
unified repository—typically a data warehouse. It enables data analysis to provide
actionable business information, effectively preparing data for analysis and business
intelligence processes.
As data engineers are experts at making data ready for consumption by working with
multiple systems and tools, data engineering encompasses ETL. Data engineering
involves ingesting, transforming, delivering, and sharing data for analysis. These
fundamental tasks are completed via data pipelines that automate the process in a
repeatable way. A data pipeline is a set of data-processing elements that move data
from source to destination, and often from one format (raw) to another (analytics-ready).
ETL PROCESS
There are three unique processes in extract, transform, load. These are:
Extraction, in which raw data is pulled from a source or multiple sources. Data could
come from transactional applications, such as customer relationship management
(CRM) data from Salesforce or enterprise resource planning (ERP) data from SAP, or
Internet of Things (IoT) sensors that gather readings from a production line or factory
floor operation, for example. To create a data warehouse, extraction typically involves
combining data from these various sources into a single data set and then validating the
data with invalid data flagged or removed. Extracted data may be several formats, such
as relational databases, XML, JSON, and others.
ETL TOOLS
ETL tools automate the extraction, transforming, and loading processes, consolidating
data from multiple data sources or databases. These tools may have data profiling, data
cleansing, and metadata-writing capabilities. A tool should be secure, easy to use and
maintain, and compatible with all components of an organization’s existing data
solutions.
What is ETL?
Knowledge center»
Data integration»
ETL vs ELT: Defining the Difference»
What is ETL?
Related articles
ETL tools enable data integration strategies by allowing companies to gather data from multiple data
sources and consolidate it into a single, centralized location. ETL tools also make it possible for different
types of data to work together.
A typical ETL process collects and refines different types of data, then delivers the data to a data lake or
data warehouse such as Redshift, Azure, or BigQuery.
ETL tools also makes it possible to migrate data between a variety of sources, destinations, and analysis
tools. As a result, the ETL process plays a critical role in producing business intelligence and executing
broader data management strategies. We are also seeing the process of Reverse ETL become more
common, where cleaned and transformed data is sent from the data warehouse back into the business
application.
Step 1: Extraction
Most businesses manage data from a variety of data sources and use a number of data analysis tools to
produce business intelligence. To execute such a complex data strategy, the data must be able to travel
freely between systems and apps.
Before data can be moved to a new destination, it must first be extracted from its source — such as a
data warehouse or data lake. In this first step of the ETL process, structured and unstructured data is
imported and consolidated into a single repository. Volumes of data can be extracted from a wide range
of data sources, including:
Step 2: Transformation
During this phase of the ETL process, rules and regulations can be applied that ensure data quality and
accessibility. You can also apply rules to help your company meet reporting requirements. The process
of data transformation is comprised of several sub-processes:
Transformation is generally considered to be the most important part of the ETL process. Data
transformation improves data integrity — removing duplicates and ensuring that raw data arrives at its
new destination fully compatible and ready to use.
See why Talend was named a Leader in the 2022 Magic Quadrant™ for Data Integration Tools
for the seventh year in a row
Full loading — In an ETL full loading scenario, everything that comes from the transformation assembly
line goes into new, unique records in the data warehouse or data repository. Though there may be
times this is useful for research purposes, full loading produces datasets that grow exponentially and can
quickly become difficult to maintain.
Incremental loading — A less comprehensive but more manageable approach is incremental loading.
Incremental loading compares incoming data with what’s already on hand, and only produces additional
records if new and unique information is found. This architecture allows smaller, less expensive data
warehouses to maintain and manage business intelligence.
ETL use case: business intelligence
Data strategies are more complex than they’ve ever been; SaaS gives companies access to data from
more data sources than ever before. ETL tools make it possible to transform vast quantities of data into
actionable business intelligence.
Consider the amount of raw data available to a manufacturer. In addition to the data generated by
sensors in the facility and the machines on an assembly line, the company also collects marketing, sales,
logistics, and financial data (often using a SaaS tool).
All of that data must be extracted, transformed, and loaded into a new destination for analysis. ETL
enables data management, business intelligence, data analytics, and machine learning capabilities by:
Most companies today rely on an ETL tool as part of their data integration process. ETL tools are known
for their speed, reliability, and cost-effectiveness, as well as their compatibility with broader data
management strategies. ETL tools also incorporate a broad range of data quality and data governance
features.
When choosing which ETL tool to use, you’ll want to consider the number and variety of connectors
you’ll need as well as its portability and ease of use. You’ll also need to determine if an open-source tool
is right for your business since these typically provide more flexibility and help users avoid vendor lock-
in.
ELT vs ETL
Traditional ETL software extracts and transforms data from different sources before loading it into a
data warehouse or data lake. With the introduction of the cloud data warehouse, there was no longer
the need for data cleanup on dedicated ETL hardware before loading into your data warehouse or data
lake. The cloud enables a push-down ELT architecture with two steps changed from the ETL pipeline.
EXTRACT Extract the data from multiple data sources and connectors
LOAD Load it into the cloud data warehouse
TRANSFORM Transform it using the power and scalability of the target cloud platform
If you are still on premises and your data isn't coming from several different sources, ETL tools still fit
your data analytics needs. But as more businesses move to a cloud data architecture (or hybrid), ELT
processes are more adaptable and scalable to evolving needs of cloud-based businesses.
Share
Banking offers numerous career options for freshers and experienced professionals.
Nevertheless, along with the academic qualifications and aptitude, the interview process
has to be cleared which consists of varied types of interview questions.
A job in banking may or may not require experience but it does require an impressive
interview round.
The finance and banking industry attempts a range of entrances for graduates from
various academic regulations such as corporate banking, Customer relationship
management, researchers or tax analysts, analysts etc.
In the article below, let's explore the top 21 banking interview questions and answers
which will help you clear the interview with flying colours.
Answer: It is the first fundamental question that every interviewer asks a candidate to
start the conversation and know about the person. So, always be positive and introduce
yourself starting with your name, qualification and all the other required information that
is important for an interviewer to know. Just complete it within 2 minutes so that it
should not be extended as a boring conversation.
Question 2: Why do you want to join the banking sector?
Answer: In this question, be logical and answer it by telling why banking sectors have
influenced people with all the facts and figures, ready as to why the banking sector is the
fastest-growing sector. Do not start by telling that you want to have a stable career or
some personal view. Just make it well versed which can form a correct opinion of your
answer.
Answer: Be straight forward and start your answer by telling the information which can
match the question asked by an Interviewer. The types of accounts in banks are:
Checking Account: You can access the account as saving account but, unlike saving account,
you cannot earn interest on this account. The benefit of opening a checking account in a bank is
there is no limit for withdrawal.
Money Market Account: This account gives both the benefit of savings account and checking
accounts. You can withdraw the amount and yet you can earn higher interest on it. This type of
account can be opened with a minimum balance.
Certificate of Deposit Account (CD): By the opening of such account you have to deposit your
money for the fixed period like five years or seven years, and you will earn the interest on it. The
rate of interest will be decided by the bank, and you cannot withdraw the funds until the fixed
period expires.
Saving Account: You can save your money in such account and also earn interest on it. The
number of withdrawal is limited and need to maintain the minimum amount balance in the
account to remain active.
Question 4: What are the necessary documents a person requires to open an account
in a bank?
Answer: As per the RBI advised banks to follow the Know Your Customer (KYC)
guidelines where the bank obtains some personal information of the account holder. The
primary document that is needed to open an account are photographs, proof of identity
proof like Aadhar card or Pan Card etc., and address proof as well.
Answer: APR is known as the Annual percentage rate. It is a charge or interest that the
bank imposes on their customers for using their services like loans, credit cards etc. The
interest is calculated annually.
Answer: Debt to income ratio is calculated by dividing a loan applicant’s total debt
payment by his gross income.
Answer: Loan grading is the classification of the loan based on various risks and
parameters like repayment risk, borrowers credit history etc. The system places a loan on
one to six categories, based on the stability and risk associated with the loan.
Answer: A person who signs a note to guarantee the payment of the loan on behalf of
the main loan applicant’s is known as Co-maker or signer.
Question 11: What is the line of credit?
Answer: Line of credit is an agreement between the bank and a borrower, to provide a
certain amount of loans on borrower’s demand. The borrower can withdraw the amount
at any moment and pay the interest only on the amount withdraw.
Accepting deposit
Banking Value chain
Interest spread
Providing funds to borrowers on interest
Additional charges on services like checking account maintenance, online bill payment etc.
Question 13: What is the payroll card?
Answer: Payroll cards are types of smart cards issued by banks to facilitate salary
payments between employer and employees. Through payroll card, the employer can
load salary payments onto an employee’s smart card, and employee can withdraw the
salary even though if he or she doesn’t have an account in the bank.
Answer: A Payday loan refers to a small amount and a short term loan available at the
high-interest rate.
Question 17: What are the different types of loans offered by commercial banks?
So, these are the question and answer that can easily help you to clear the interview panel
and get the job position in the banking sector. You can also surf for more questions
through Google that can lend you as a helping hand.
1. Branches of Bank.
2. Mobile Banking.
3. ATM.
4. Internet Banking.
Carefully consider these frequently asked banking interview questions and use
the excellent interview answer help and guidelines to prepare your own winning
bank interview answers. Get the banking job you want!
Emphasize what qualifies you for this banking job and how you can add value to
both the position and the bank. Look at the banking job requirements such as:
accuracy
customer care
computer skills
numeracy skills
communication skills
Highlight how you have demonstrated these skills previously.
If there are areas of the job function that you do not yet have experience in, then
highlight what skills you have that will facilitate learning and performing these
tasks.
For example your ability to remain calm under pressure and communicate
clearly will help you in dealing with customers.
Emphasize qualities like loyalty, integrity, confidentiality and commitment.
State your technical knowledge and confirm your understanding of the
basics of bank products and services.
Those banking interview questions that ask you to provide an example of how
you have previously demonstrated a skill or ability are called behavioral based
interview questions.
These type of questions are used to assess whether you have the necessary
competencies to perform in a banking job. The behavioral interview question and
answer guide will help you to prepare for these bank interview questions.
Find out more about customer service job interview questions and answers and
be well prepared for any customer service orientated interview questions.
Bank job candidates may be asked to define good customer service in order to
evaluate their customer service orientation.
3. Tell me about a time where you had to use your discretion and
tact to do your job properly.
You are often required to display diplomacy and tact with customers in a banking
environment. Provide an example of a challenging situation where you had to
handle the customer carefully and with discretion.
Discuss how you used your sensitivity and communication skills to manage the
situation.
Provide specific examples of how you check your outputs for accuracy and
completeness and what you do if you find a mistake.
Your example should clearly indicate how you changed your communication style
to meet the customer's needs.
Discuss the resources you use to meet the different work demands including
prioritizing, planning, scheduling and asking for assistance when appropriate.
Your judgment is also under scrutiny here so describe your motivation to take
action.
8. How many scheduled days have you missed during the last
four months?
Bank interview questions will explore your reliability. Be honest about this as it
can always be verified with a reference check.
Focus on your reliability, punctuality and your willingness to work extra hours if
needed.
9. What are the most important qualities for a bank teller job?
Bank interview questions like this are asked to explore your understanding of
banking job requirements. Focus on technical skills such as:
numeracy
computer literacy
product and services knowledge
Discuss key job competencies including accuracy, customer service orientation,
judgment, integrity, reliability and the ability to cope under pressure.
Point out your strengths as they relate to these qualities. Use this list of
strengths to help you.