KEMBAR78
DataOps Guide for Big Data & ETL | PDF | Databases | Test Driven Development
100% found this document useful (1 vote)
210 views13 pages

DataOps Guide for Big Data & ETL

This document discusses implementing a DataOps approach to improve data projects. DataOps applies agile development, continuous integration and deployment, and DevOps principles to data projects. It recommends that organizations focus on culture, processes, and automation tools. Adopting DataOps can help reduce time to market, prevent failed projects, improve data quality, lower production costs, and enable better testing and monitoring for data projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
210 views13 pages

DataOps Guide for Big Data & ETL

This document discusses implementing a DataOps approach to improve data projects. DataOps applies agile development, continuous integration and deployment, and DevOps principles to data projects. It recommends that organizations focus on culture, processes, and automation tools. Adopting DataOps can help reduce time to market, prevent failed projects, improve data quality, lower production costs, and enable better testing and monitoring for data projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DataOps

Implementation
Guide
DATAOPS FOR BIG DATA, ETL, DATA MIGRATION,
BUSINESS INTELLIGENCE REPORTING

Don’t be Siloed… Adopt DataOps

Sandesh Gawande
CTO-ICEDQ | TORANA, INC. STAMFORD CT USA | 203 666 4442 |
CONTACT@ICEDQ.COM
DATAOPS IMPLEMENTATION GUIDE

Contents
Abstract ................................................................................................................................................... 2
Problem Statement ................................................................................................................................. 2
Solution: .................................................................................................................................................. 5
What is DataOps?................................................................................................................................ 5
How to implement DataOps? ............................................................................................................. 5
Why DataOps with iCEDQ results in better Data Quality? ............................................................... 10
Conclusion ............................................................................................................................................. 10
Appendix A: Enable DataOps ................................................................................................................ 11
Appendix B: Testing and Monitoring Rule Patterns .............................................................................. 12

SANDESH GAWANDE 1
DATAOPS IMPLEMENTATION GUIDE

Abstract
Data projects in the form of data warehouse, data lake, big data, cloud data migration, BI
reporting and analytics, machine learning are manifesting in every organization. While
project timelines are shrinking, the number of data projects are increasing as is the complexity.

We have observed that data-centric applications are lacking the rigors and the discipline required
to execute these large and complex projects. While general software projects have adopted
the CICD and DevOps principles, the data integration and migration projects are still living
under the rock. With the advent of Big data and Cloud technology, this has become a huge
problem.

Time-to-market for a data project has become critical in organizations of all sizes. This paper
discusses the adoption of DataOps methodologies for data and big data projects, to improve the
success of the project as well as speed up the time-to-market. We further analyse some of the
bottlenecks such as: organizational culture, data test automation and how they are
hindering the implementation of DataOps. Ultimately, we are proposing the DataOps
solution to improve both delivery of the data project and data quality.

Problem Statement Data-centric projects are becoming both bigger in size and complexity, which
makes execution that much more difficult. This not only creates delays in project execution but
also results in poor data quality. More and more projects are facing:

­ Longer time to Market - The time required for projects is increasing, with many
cloud data migration projects having multi-year timelines.

­ Delayed or failed projects - Data teams are underestimating the complexity of the
data projects resulting in last moment surprises as well as cost over runs.

­ Poor Data Quality - Projects are delayed due to testing issues that are discovered too
late in the project lifecycle.

­ User dissatisfaction and Complaints - Data quality is after thought, resulting in high
rates of user dissatisfaction.

­ High Production Cost Fixes - Lack of test automation has resulted in lots of
refactoring or patchwork in production.

­ Testing on big data volumes - The large volumes has made is generally impossible to
test the data manually.

­ Regression testing nearly impossible - After the delivery of the project, code revision
or ETL processes require complete regression testing. However, these concepts are
missing in the data engineering side.

­ Costly Manpower - The manual and repetitive tasks are still not automated and
either requires manual work or custom coding, which often will take highly skilled talent
off of other critical work.

SANDESH GAWANDE 2
DATAOPS IMPLEMENTATION GUIDE

While there are many macro and micro issues affecting the delivery of data engineering projects,
the following are some of the underlying causes:

1. Siloed Teams: The team is usually divided into development, QA, operations and business
users. In almost all Data Integration projects, development teams try to build and test ETL
processes, reports as fast as possible and throw the code across the wall to the operations teams
and business users. However, when the data issues start appearing in production, the business
users become unhappy. They point fingers at Operations people, who in turn point fingers at QA
people. The QA group then puts the blame the development teams.

During Development of ETL and Release…

…and when data defects are found in production after release!

2. Lack of Code Repository: ETL, Database procedures, schemas, schedules and reports are not
treated as code. In the early nineties, the ETL and Reporting tools came into existence and since
they created custom ETL objects or Reports, they were not treated as code.

3. Lack of Data Management Repository: Configuration data, Reference data and Test data are
not managed. A data project requires test data, however test data is not created in advance nor
linked to the test cases.

Reference data is required to initialize the database. For example, default values for customer
types must be created in advance so it doesn’t have any data source. If the reference data is
missing, none of the ETL processes will work.

Configuration tables data must also be prepopulated. Some of the configuration data is used for
incremental or delta processing. Some data values are used to populate metadata about the
processes.

4. Lack of Test Automation: The way data processes (ETL) and reports are tested is very
different than how software applications are tested. In order to test, the ETL process is
executed first and then the data is compared from the original to certify the ETL process. This is
because the

SANDESH GAWANDE 3
DATAOPS IMPLEMENTATION GUIDE

quality is determined by the expected vs actual. The actual data is the data added or updated by
the ETL process and expected is the input data plus the data transformation rule(s).

5. Lack of Automated Build and Deployment: Since most ETL and Report developers use GUI /
tools to create their processes, the code is not visible. The ETL tool stores the code directly into its
repository. This creates a false narrative that since there is no code, there’s no need to manage,
version or integrate. The majority of ETL tools now provide APIs to import and deploy the code
into different environments, the functionality of which is often ignored.

6. Lack of Agile & Test-Driven Development (TDD): While data transformation rules are
provided to developers, the business doesn’t share testing and monitoring requirements during
development. Once the developers have completed development, only then the focus shifts to
testing. This is now late in the process and quite often this is when users start complaining. At
this late stage is the time when data monitoring issues are considered.

7. Lack of Whitebox Monitoring: Data Quality and governance is an afterthought. Developers


neither seek nor integrates hooks into their data process to monitor the data quality once it’s in
production. When the system goes live, there is nothing available for operations to certify the
data.

8. Lack of Regression Testing: After the system goes live, if any data issues are found, the
development team must go back and fix the code. This creates a big testing challenge in order to
complete regression testing, since previous/older test cases must be considered to test the ETL
flow. If they’ve not used a test automation tool that stores the rules in a repository, nothing will
exist.

SANDESH GAWANDE 4
DATAOPS IMPLEMENTATION GUIDE

Solution:
Many of the problem statements defined above are already solved in the software development
world, implementing concepts such as Agile Development, CICD, Test Automation, and DevOps.
It’s time the data world borrows some of these ideas and adopts them in the data world as well.

What is DataOps?
DataOps is the application of Agile Development, Continues Integration, Continues Deployment,
Continuous Testing methodologies and DevOps principles, with the addition of some data
specific
considerations to a data-centric project. It could be any of the data integration or data migration
projects such as data warehouses, data lakes, big data, ETL, Data Migration, BI Reporting and Cloud
Migration.

How to implement DataOps?


To implement DataOps the organization must focus on three
things:
­ People and their culture
­ Defining standard practices and
processes. ­ Automation testing &
Monitoring tools

DataOps = 1. Culture

+ 2. Tools
+ 3. Practices

SANDESH GAWANDE 5
DATAOPS IMPLEMENTATION GUIDE

A. Identify the people and their culture – In a data project there are many types of resources.
However, their roles also define their boundary. Developers, testers on one side of the wall
and business users, operations data stewards are on the other side.
DataOps is about removing this wall and the first cultural change required for DataOps is to:
­ Tell the development team that they are responsible for data quality issues that will
appear in production environments.
­ Tell the business users it’s their responsibility to provide the data transformation

requirements as well as Audit Rules for Validation and Reconciliation of data.


This small change will ensure the developer involves the business users and data stewards
right from the beginning of the project. DataOps adoption results in a transformation of
organizational culture, automating every aspect of SDLC from test automation to production
data monitoring. Beyond that, DataOps results in a culture shift, which removes the
barriers between development and operations teams that are broken. There is no more
throwing over the wall and running away from the responsibilities.

DataOps Transforms the Culture of the Organization.

With DataOps everyone is on the same side of the wall!

Now, instead of sequential steps, developers can create the design and develop the tests in
parallel to the development of the data pipeline. By using Non-Linear timelines Time-to-
Market is now 33% faster.

SANDESH GAWANDE 6
DATAOPS IMPLEMENTATION GUIDE

B. Get the automation tools for DataOps – DataOps in not possible with proper automation
tools. The organization must acquire multiple software platforms to support DataOps, such as:
a. Code Repository, Ex. Git
b. QA software for Data Test Automation, Ex. iCEDQ
c. Test Data Repository, Ex. Stored in dedicated database or file server
d. CICD software, Ex. Jenkins
e. Production Data Monitoring Software, Ex. iCEDQ
f. Issue management software, Ex. Jira, ServiceNow

The idea is to continuously integrate, deploy, test and monitor the data and processes in an
automated fashion. The purpose of each tool will be clearer with the process diagrams in the
section below.

C. Define the DataOps Practice – Requirements process, development process, data testing
process, test data management, production data monitoring and defect tracking. Assuming
people and the tools are in place.
a. Develop and Integrate in a Code Repository

Store the code in a repository. The main


requirement is to ensure the code is
accessible to some automation tool. The
code in data-centric projects is a
combination of ETL code, BI report code,
scheduler/orchestration code, database
procedures, database schema (DDL) and
some DML. Both ETL and reports code
must be captured and stored in some
repository.

The test automation repository should


consist of test cases, regression
packs, test automation scripts and
production data monitoring rules. This test
can be called on- demand by a CICD script.
This will ensure all testing and monitoring
rules are stored and accessible in the future.

DataOps has a very special data


component required for configurations, test
and database initialization. Eventually, the
ETL process will collect data, the database
will store it and the reports will show it,
when the system is fully live. Whenever
new data processes are added or updated,
usually some data must be
prepopulated into the database e.g., it could
be reference data, test data or configuration
data. Configuration data could be dates
required for incremental loads.

SANDESH GAWANDE 7
DATAOPS IMPLEMENTATION GUIDE

b. Implement CICD Pipeline

DataOps changes both the culture as well as the processes.

SANDESH GAWANDE 8
DATAOPS IMPLEMENTATION GUIDE

1. Continuous Integration - In the previous section it’s clear that call code must be stored in
some repository and available for DevOps automation. With code it becomes easy to
manage various code branches and versions. Based on the release plan, code can be
selected and integrated with the help of CICD tools like Jenkins.
2. Continuous Deployment - The integrated code is pulled by Jenkins and deployed with
help of API’s of command line import and export utilities. Depending on the code type, the
code is pushed to a Database, ETL, Reporting platform. Further, CICD tools will also
deploy initialization data in the database. This will create the necessary QA or production
environment which is ready for further execution.
3. Initialization Tests - Once the environment is ready with code and data, the CICD tool will
execute iCEDQ rules to validate the data structures (database objects, tables, column,
datatypes, etc.) as well as initial data.
4. ETL/Report Execution - The next step for CICD tool is to execute the scheduler to
orchestrate execution of the ETL process and reports.
5. ETL/Report Testing - Once the data is loaded by ETL and reports are executed, iCEDQ
can run the test and verify the validity of both the ETL as well as report quality. (This is
unique to DataOps because without first executing the ETL or the reports, there is no way
to do the data testing.
6. Production Monitoring - Once the system is live, the hooks left by the development and
QA team will be used for monitoring the production systems, which is also sometimes
referred to as white box monitoring. The business also benefits as they now have hooks
(testing rules) developed by QA teams available to monitor the production data pipeline
on an ongoing basis.

Production Data Monitoring

a. Once the system is online and running based on the schedules, the Audit Rules
in iCEDQ will also start running.
b. When ICEDQ notices any discrepancies in the data it will identify the specific
data issues and raise alerts.
c. The Issue logging system can then be used as a source of changes in the data
pipeline or simple update of the data.

If there is a change in the code due to defects found in the data or a new business requirement
is discovered, the DataOps cycle repeats again.

SANDESH GAWANDE 9
DATAOPS IMPLEMENTATION GUIDE

Why DataOps with iCEDQ results in better Data Quality?


One of the direct impacts of DataOps is the improvement of
data quality for the data pipeline. There are three core reasons
for this:
I. Cultural Changes
­ One of the concepts of us versus them is gone.
The developers are responsible for the quality of
data in production.
­ Business Users are involved early and forces them to provide business
requirements as well as the data testing and monitoring requirements.
­ All these checks and validations are added in iCEDQ and can be further used
to
test and/or monitor the data in
production. II. Automation of Testing
­ DataOps results in test automation which can improve productivity by 70%
over
manual testing.
­ Now that that the tests are automated, the test coverage can improve by
more
than 200%. Some tests are time consuming if done manually, however with
DataOps automation there are no such limitations. There can now be an
increase in both the number of tests as well as the complexity of the tests
that can be run.
­ The cost of production monitoring and refactoring of code is reduced as more
defects are captured early in the life cycle of the data pipeline.
­ Test and monitoring automation also enable regression testing. The testing
and
monitoring rules are stored in the system and can be recalled as needed
during the regression tests.
III. Production Monitoring
­ Some of the tests created during development and QA in iCEDQ are reused
in
the Production environment to monitor the data.
­ Automation of monitoring also removes the limits on the volume of data
that
can be scanned. Organizations can move from sampling data to big data
without any issues. With its Big Data edition iCEDQ platform can monitor the
production data without data volume constraints.
­ The iCEDQ rules can be embedded in the ETL batch or rules can also run
periodically with its built-in scheduler.
­ iCEDQ notifies the workflow or ticketing systems whenever there is a data
issue.
Conclusion
DataOps is all about reducing data organization siloes and automating the data engineering
pipeline. CDOs, business users, data stewards are involved early in the data development life
cycle. It forces organization to automate all its processes, including testing. The data quality tasks
are now implemented early in project life cycle. This provides enormous benefits to the data
development team as well as operations and business teams with data issues occurring in
production environments.
­ Faster Time-to-Market
­ Improves Data Quality
­ Lowers Cost Per
Defect

SANDESH GAWANDE 10
DATAOPS IMPLEMENTATION GUIDE

Appendix A: Enable DataOps

SANDESH GAWANDE 11
DATAOPS IMPLEMENTATION GUIDE

Appendix B: Testing and Monitoring Rule Patterns

SANDESH GAWANDE 12

You might also like