Utlimate Guide: ETL/ Datawarehouse Testing
Before we learn ETL Testing, let’s understand –
What is BI?
Business Intelligence is the process of collecting raw data or business data and turning it into
information that is useful and more meaningful. The raw data is the records of the daily
transaction of an organization such as interactions with customers, administration of finance, and
management of employee and so on. These data’s will be used for “Reporting, Analysis, Data
mining, Data quality and Interpretation, Predictive Analysis”.
What is Data Warehouse?
A data warehouse is a database that is designed for query and analysis rather than for transaction
processing. The data warehouse is constructed by integrating the data from multiple
heterogeneous sources.It enables the company or organization to consolidate data from several
sources and separates analysis workload from transaction workload. Data is turned into high
quality information to meet all enterprise reporting requirements for all levels of users.
What is ETL?
ETL stands for Extract-Transform-Load and it is a process of how data is loaded from the source
system to the data warehouse. Data is extracted from an OLTP database, transformed to match
the data warehouse schema and loaded into the data warehouse database. Many data warehouses
also incorporate data from non-OLTP systems such as text files, legacy systems and
spreadsheets.
Let see how it works
For example, there is a retail store which has different departments like sales, marketing,
logistics etc. Each of them is handling the customer information independently, and the way
they store that data is quite different. The sales department have stored it by customer’s name,
while marketing department by customer id.
Now if they want to check the history of the customer and want to know what the different
products he/she bought owing to different marketing campaigns; it would be very tedious.
The solution is to use a Datawarehouse to store information from different sources in a uniform
structure using ETL. ETL can transform dissimilar data sets into aunifiedstructure.Later use BI
tools to derive meaningful insights and reports from this data.
The following diagram gives you the ROAD MAP of the ETL process
1. Extract
Extract relevant data
2. Transform
Transform data to DW (Data Warehouse) format
Build keys - A key is one or more data attributes that uniquely identify an
entity. Various types of keys are primary key, alternate key, foreign key,
composite key, surrogate key. The datawarehouse owns these keys and
never allows any other entity to assign them.
Cleansing of data :After the data is extracted, it will move into the next
phase, of cleaning and conforming of data. Cleaning does the omission in
the data as well as identifying and fixing the errors. Conforming means
resolving the conflicts between those data’s that is incompatible, so that
they can be used in an enterprise data warehouse. In addition to these, this
system creates meta-data that is used to diagnose source system problems
and improves data quality.
3. Load
Load data into DW ( Data Warehouse)
Build aggregates - Creating an aggregate is summarizing and storing data
which is available in fact table in order to improve the performance of
end-user queries.
ETL Testing Process
Similar to other Testing Process, ETL also go through different phases. The different phases of
ETL testing process is as follows
ETL testing is performed in five stages
1. Identifying data sources and requirements
2. Data acquisition
3. Implement business logics and dimensional Modelling
4. Build and populate data
5. Build Reports
Types of ETL Testing
Types Of Testing Testing Process
Production Validation Testing “Table balancing” or “production reconciliation” this
type of ETL testing is done on data as it is being
moved into production systems. To support your
business decision, the data in your production
systems has to be in the correct order. Informatica
Data Validation Option provides the ETL testing
automation and management capabilities to ensure
that production systems are not compromised by the
data.
Such type of testing is carried out to validate whether
Source to Target Testing
the data values transformed are the expected data
(Validation Testing)
values.
Such type of ETL testing can be automatically
generated, saving substantial test development time.
Application Upgrades This type of testing checks whether the data extracted
from an older application or repository are exactly
same as the data in a repository or new application.
Metadata testing includes testing of data type check,
Metadata Testing
data length check and index/constraint check.
To verify that all the expected data is loaded in target
from the source, data completeness testing is done.
Some of the tests that can be run are compare and
Data Completeness Testing
validate counts, aggregates and actual data between
the source and target for columns with simple
transformation or no transformation.
This testing is done to ensure that the data is
Data Accuracy Testing
accurately loaded and transformed as expected.
Testing data transformation is done as in many cases
it cannot be achieved by writing one source SQL
Data Transformation Testing query and comparing the output with the target.
Multiple SQL queries may need to be run for each
row to verify the transformation rules.
Data Quality Tests includes syntax and reference
tests. In order to avoid any error due to date or order
number during business process Data Quality testing
is done.
Syntax Tests: It will report dirty data, based on
invalid characters, character pattern, incorrect upper
Data Quality Testing
or lower case order etc.
Reference Tests: It will check the data according to
the data model. For example: Customer ID
Data quality testing includes number check, date
check, precision check, data check , null check etc.
This testing is done to check the data integrity of old
and new data with the addition of new data.
Incremental ETL testing Incremental testing verifies that the inserts and
updates are getting processed as expected during
incremental ETL process.
This testing is done to check the navigation or GUI
GUI/Navigation Testing
aspects of the front end reports.
How to create ETL Test Case
ETL testing is a concept which can be applied to different tools and databases in information
management industry. The objective of ETL testing is to assure that the data that has been
loaded from a source to destination after business transformation is accurate. It also
involves the verification of data at various middle stages that are being used between source and
destination.
While performing ETL testing, two documents that will always be used by an ETL tester are
1. ETL mapping sheets :An ETL mapping sheets contain all the information of
source and destination tables including each and every column and their look-
up in reference tables. An ETL testers need to be comfortable with SQL
queries as ETL testing may involve writing big queries with multiple joins to
validate data at any stage of ETL. ETL mapping sheets provide a significant
help while writing queries for data verification.
2. DB Schema of Source, Target: It should be kept handy to verify any detail in
mapping sheets.
ETL Test Scenarios and Test Cases
Test Scenario Test Cases
Verify mapping doc whether corresponding ETL
Mapping doc validation information is provided or not. Change log should
maintain in every mapping doc.
Validation 1. Validate the source and target table structure
against corresponding mapping doc.
2. Source data type and target data type should be
same
3. Length of data types in both source and target
should be equal
4. Verify that data field types and formats are
specified
5. Source data type length should not less than the
target data type length
6. Validate the name of columns in the table
against mapping doc.
Ensure the constraints are defined for specific table as
Constraint Validation
expected
1. The data type and length for a particular
Data consistency issues attribute may vary in files or tables though the
semantic definition is the same.
2. Misuse of integrity constraints
1. Ensure that all expected data is loaded into
target table.
2. Compare record counts between source and
target.
3. Check for any rejected records
Completeness Issues 4. Check data should not be truncated in the
column of target tables
5. Check boundary value analysis
6. Compares unique values of key fields between
data loaded to WH and source data
1. Data that is misspelled or inaccurately recorded
Correctness Issues 2. Null, non-unique or out of range data
Transformation Transformation
1. Number check: Need to number check and
validate it
2. Date Check: They have to follow date format
and it should be same across all records
Data Quality
3. Precision Check
4. Data check
5. Null check
Verify the null values, where “Not Null” specified for
Null Validate
a specific column.
Duplicate Check 1. Needs to validate the unique key, primary key
and any other column should be unique as per
the business requirements are having any
duplicate rows
2. Check if any duplicate values exist in any
column which is extracting from multiple
columns in source and combining into one
column
3. As per the client requirements, needs to be
ensure that no duplicates in combination of
multiple columns within target only
Date values are using many areas in ETL development
for
1. To know the row creation date
2. Identify active records as per the ETL
Date Validation development perspective
3. Identify active records as per the business
requirements perspective
4. Sometimes based on the date values the
updates and inserts are generated.
1. To validate the complete data set in source and
target table minus a query in a best solution
2. We need to source minus target and target
minus source
3. If minus query returns any value those should
be considered as mismatching rows
4. Needs to matching rows among source and
target using intersect statement
Complete Data Validation
5. The count returned by intersect should match
with individual counts of source and target
tables
6. If minus query returns of rows and count
intersect is less than source count or target
table then we can consider as duplicate rows
are existed.
Unnecessary columns should be deleted before loading
Data Cleanness
into the staging area.
Types of ETL Bugs
Type of Bugs Description
Related to GUI of application
User interface Font style, font size, colors, alignment, spelling mistakes,
bugs/cosmetic bugs navigation and so on
Boundary Value
Minimum and maximum values
Analysis (BVA) related
bug
Equivalence Class
Valid and invalid type
Partitioning (ECP)
related bug
Valid values not accepted
Input/Output bugs Invalid values accepted
Mathematical errors
Calculation bugs Final output is wrong
Does not allows multiple users
Load Condition bugs Does not allows customer expected load
System crash & hang
Race Condition bugs System cannot run client platforms
No logo matching
No version information available
Version control bugs
This occurs usually in regression testing
Device is not responding to the application
H/W bugs
Mistakes in help documents
Help Source bugs
Difference between Database testing and ETL testing
ETL Testing Data Base Testing
The primary goal is to check if the data is
Verifies whether data is moved as expected following the rules/ standards defined in the
Data Model
Verifies whether counts in the source and target
are matching
Verify that there are no orphan records and
foreign-primary key relations are maintained
Verifies whether the data transformed is as per
expectation
Verifies that the foreign primary key relations Verifies that there are no redundant tables and
are preserved during the ETL database is optimally normalized
Verify if data is missing in columns where
Verifies for duplication in loaded data
required
Responsibilities of an ETL tester
Key responsibilities of an ETL tester are segregated into three categories
Stage table/ SFS or MFS
Business transformation logic applied
Target table loading from stage file or table after applying atransformation.
Some of the responsibilities of an ETL tester are
Test ETL software
Test components of ETL datawarehouse
Execute backend data-driven test
Create, design and execute test cases, test plans and test harness
Identify the problem and provide solutions for potential issues
Approve requirements and design specifications
Data transfers and Test flat file
Writing SQL queries3 for various scenarios like count test
ETL Performance Testing and Tuning
ETL performance testingis a confirmation test to ensure that an ETL system can handle the load
of multiple users and transactions. The goal of performance tuning is to optimize session
performance by eliminating performance bottlenecks. To tune or improve the performance of the
session, you have to identify performance bottlenecks and eliminate it. Performance bottlenecks
can be found in source and target databases, the mapping, the session and the system. One of the
best tools used for performance testing is Informatica.
Automation of ETL Testing
The general methodology of ETL testing is to use SQL scripting or do “eyeballing” of data..
These approaches to ETL testing are time-consuming, error-prone and seldom provide complete
test coverage. To accelerate, improve coverage, reduce costs, improve defect detection ration of
ETL testing in production and development environments, automation is the need of the hour.
One such tool is Informatica.
Best Practices for ETL Testing
1. Make sure data is transformed correctly
2. Without any data loss and truncation projected data should be loaded into the data
warehouse
3. Ensure that ETL application appropriately rejects and replaces with default values and
reports invalid data
4. Need to ensure that the data loaded in data warehouse within prescribed and expected
time frames to confirm scalability and performance
5. All methods should have appropriate unit tests regardless of visibility
6. To measure their effectiveness all unit tests should use appropriate coverage techniques
7. Strive for one assertion per test case
8. Create unit tests that target exceptions