How to test Azure Data
Pipeline?
Shivanand Veerbhadrannavar
Shivanand Veerbhadrannavar
Innovative Technical Manager | Data Quality & Analysis Expert | ML and
Strategic Data Management Leader
Published Oct 15, 2023
+ Follow
In today’s data-driven world, organisations often need to
extract, transform, and load (ETL) data from various sources
to support their business processes. Azure Data Factory
(ADF) is one of the powerful cloud-based services provided by
Microsoft that enables you to build and manage ETL pipelines
at scale. In this blog, we will walk you through the process of
using Azure Data Factory to validate an ETL pipeline.
Prerequisites
To follow along with this blog, you will need:
1. An Azure subscription: Sign up for a free Azure
account if you don’t have one
2. Azure Data Factory: Create an Azure Data Factory
instance in your Azure portal
3. Knowledge over Azure cloud services and it’s uses
such as Storage account, SQL server, SQL database
etc
Steps to create data pipeline in ADF:
Step 1: Set up Azure Data Factory
1. Go to the Azure portal and create a new Azure Data
Factory instance
2. Provide the necessary details such as name,
subscription, resource group, and location
3. Once the Data Factory instance is created, navigate
to it and click on the “Author & Monitor” button to
open the Azure Data Factory user interface
Step 2: Create Linked Services
1. Click on the “Author” button in the Azure Data
Factory user interface to start building your ETL
pipeline
2. Begin by creating linked services, which represent
the connection information to your data sources and
destinations
3. Click on the “Manage” tab, select the desired type of
data source or destination, and provide the required
connection details. For instance, you can create
linked services for Azure SQL Database, Azure Blob
Storage, or an on-premises SQL Server
Step 3: Create Datasets
1. After setting up linked services, proceed to create
datasets. Datasets define the structure and location
of your source and destination data
2. Click on the “Author” tab and select the “Datasets”
tab
3. Create a dataset for each data source and
destination, specifying the format, location, and
linked service information. For instance, you can
create a dataset for a CSV file in Azure Blob Storage
or a table in an Azure SQL Database
Step 4: Build Pipelines
1. With the linked services and datasets in place, it’s
time to create pipelines that define the ETL workflow
2. Click on the “Author” tab, select the “Pipelines” tab,
and click on the “New pipeline” button
3. Drag and drop activities onto the pipeline canvas to define
the ETL steps
4. Configure each activity based on the data movement or
transformation required. For instance, you can use the “Copy
data” activity to move data from a source dataset to a
destination dataset or the “Data Flow” activity to perform
complex transformations using Azure Data Factory Data
Flows
5. For reference, predefined Azure Data Factory pipelines
allow you to get started quickly with Data Factory. Templates
are useful when you’re new to Data Factory and want to get
started quickly. These templates reduce the development
time for building data integration projects, thereby improving
developer productivity
Step 5: Monitor and Manage ETL Pipeline
1. After building the ETL pipeline, it’s crucial to monitor
its execution and manage its performance
2. In the Azure Data Factory user interface, click on the
“Monitor” button to access the monitoring dashboard
3. Monitor the pipeline runs, track data movement, and
troubleshoot any issues that arise during execution.
4. Utilise the integration with Azure Monitor and Azure
Log Analytics for more advanced monitoring and
analytics capabilities
Step 6: Test and Validate the Pipeline
1. Test the ETL pipeline by running it manually or
scheduling it to run at specific intervals
2. Monitor the execution of the pipeline and verify that
each Activity completes successfully
3. Validate the data movement, transformation, and
loading processes by checking the output in the
destination system or running queries against the
data
By following these steps, you can create an ETL pipeline in
Azure Data Factory for testing practice. Remember to iterate
and refine your pipeline based on feedback and continuously
improve the testing process.
Validation scenarios
As a tester for an Azure Data Factory pipeline, your role is to
ensure that the ETL process runs smoothly and that the data
transformation is accurate and reliable. Before starting it is
good to have an understanding of data model, Source to
target mapping check, Understanding data dictionary,
Metadata information and Environment readiness. Here is a
list of validations to consider when testing an Azure Data
Factory pipeline:
Remember to document your test cases, including inputs,
expected outputs, and actual results, to facilitate tracking
and issue resolution. Regular regression testing should also
be conducted when changes or updates are made to the
pipeline to ensure ongoing reliability and accuracy of the ETL
process.
Sample Use cases
Use Case1: Copying Data from Azure Blob Storage to
Azure SQL Database
Recommended by LinkedIn
Azure Data Factory vs SQL Server Integration Services:…
Sandeep Kumar Valluri 1 year ago
What is Azure Data Factory (ADF)?
Shruti Anand 9 months ago
Are you Ready to Say “Buh-Bye” to ETL and “Yo!” to…
Monika Wahi 3 years ago
Let’s consider an example where we want to copy data from
a CSV file stored in Azure Blob Storage to an Azure SQL
Database table.
1. Create linked services for Azure Blob Storage and
Azure SQL Database, providing the necessary
connection information
2. Create datasets for the source CSV file and the
destination SQL Database table, specifying the
respective linked services and file formats.
3. Build a pipeline
Above pipeline tries to truncate the table with already loaded
data then it will load a newly received csv file as per input
dataset. In this case both the activities are successful.
You may consider some negative scenarios, such as running
the pipeline with missing source file and table. Appropriate
error messages should get generate with meaningful
information for troubleshooting
Missing target table:
Missing input file:
Suggestions:
All defined columns will have a fixed mapping by default. A
fixed mapping takes a defined, incoming column and maps it
to an exact name.
If necessary, use schema drift throughout your flow to protect
against schema changes from the sources.
When schema drift is enabled, all incoming fields are read
from your source during execution and passed through the
entire flow to the Sink. By default, all newly detected
columns, known as drifted columns, arrive as a string data
type. If you wish for your data flow to automatically infer data
types of drifted columns, check Infer drifted column types in
your source settings.
Use Case 2: Delta data loading from SQL DB
Delta load refers to the process of identifying and loading
only the changed or new data since the last execution of the
pipeline. It involves comparing the source data with the
previously processed data and selecting the records that
meet specific criteria, such as updated timestamps or new
identifiers.
Below pipeline is built to load delta records. There are 3 rows
identified as a delta
By implementing delta load in your ADF pipeline, you can
efficiently process incremental data updates, reduce
processing time and resources, and keep your destination
system synchronised with the changes happening in the
source system
Use Case 3: File archive
File archiving involves moving files from the primary storage
location to a separate storage container or archive. This
process is typically performed to free up space in the primary
storage and ensure long-term retention of files while still
allowing access if needed.
Check the DimAccount.txt file got deleted from blob storage
‘etltestsourcefiles’’ and moved to Archive folder
‘archivedfiles’
By incorporating file archiving in Azure Data Pipeline creation,
organisations can effectively manage data retention, improve
system performance, ensure compliance, and enable efficient
data backup and recovery.
Challenges and Solutions
As the tester responsible for testing the Azure data pipeline, I
wanted to communicate the potential challenges that we may
face during the testing process and how we plan on
overcoming them.
Conclusion
Azure Data Factory provides a robust platform for building
data pipelines, allowing testers to define validation
checkpoints at different stages of the pipeline. From
connectivity and configuration checks to data completeness,
integrity, and transformation accuracy, testers can design
comprehensive test cases to verify the data movement,
transformations, and loading processes.
Effective data validation in Azure Data Pipelines requires
meticulous planning, well-designed test cases, and
continuous monitoring. Testers should validate the data
against expected results, check for any discrepancies or
errors, and ensure compliance with industry regulations and
security standards.
In conclusion, Azure Data Pipeline provides a powerful
platform for testers to perform data validations, ensuring the
accuracy, integrity, and quality of the data being processed.
With the right approach, thorough testing, and adherence to
best practices, testers can contribute to the success of Azure
Data Factory projects and enable organisations to make
informed decisions based on trustworthy data