6 best practices for
cloud data integration
This guide describes the challenges involved in
achieving cloud interoperation and provides best
practices and solutions, including log-based
change data capture.
6 best practices for cloud data integration 1
Cloud adoption is accelerating. Given its
central role in the enterprise, IDC forecasts
"whole cloud" spending — which includes Two factors to consider
total worldwide spending on cloud services,
the hardware and software components
before adopting the
underpinning the cloud supply chain, and the
professional/managed services opportunities
hybrid cloud
around cloud services — will surpass $1.3 Hybrid cloud is different from multiple
trillion by 2025. geographically separated data centers for
two reasons:
One reason for the growth: Organizations
often adopt cloud technologies for analytic ❶ You have less control over your deployment.
use cases. The cloud enables access to robust You must trust that the cloud provider’s
analytics services at scale using a pay-for-use solution is equally (if not more) highly
scheme. Importantly, companies can avoid available and secure than your current choice.
significant upfront investments. Instead of
building a configuration in a data center, ❷ The cloud offers a variety of deployment
they can experiment with different options models. For example, in the pure on-premises
available to them — and quickly scale up or world, you manage your databases. In a
down with the cloud provider’s consumption- hybrid cloud environment, you can manage
based model. Cloud will continue to play an your databases or sign up to use a DBaaS, or
even greater role as on-premises systems database-as-a-service. This follows a similar
reach end-of-life and businesses focus on path laid out by Salesforce when they showed
delivering greater efficiency, flexibility the world they could access their CRM system
and faster innovation. as a SaaS platform.
In the cloud, you have more options relative
Most large organizations adopting cloud
to the on-prem world, each with benefits and
solutions will, at least initially, run a hybrid
disadvantages. With cloud access, you are
cloud. The term hybrid cloud refers to two
more likely to consider those alternatives.
types of deployments. Hybrid clouds can
consist of multiple cloud-based solutions
from one or more cloud vendors. Alternatively,
they can combine cloud solutions with Two common types of hybrid cloud environments
on-premises systems.
Multiple solutions from one or more cloud vendors
Cloud solutions combined with on-premises systems
6 best practices for cloud data integration 1
Why go hybrid? the criticality of the system, organizations may
also need a fallback for some time after
the initial migration.
Organizations often adopt a hybrid cloud
strategy when deploying their first analytical
solution to the cloud. The cloud combines sheer
infinite scalability with consumption-based
pricing. These are desirable attributes for Hybrid cloud data
analytical environments that require scalability
and can be very expensive when scaled for
integration best practices
maximum capacity over an extended period.
Hybrid cloud integration is not easy. Common
challenges and considerations during data
An organization’s primary business process is
integration in hybrid environments include:
generally supported by one or more operational
systems. Because these environments are ❶ Impact on the data sources
crucial, businesses must have optimum access ❷ Network efficiency
and continuous availability.
❸ Data security
Deployments and systems can be quite ❹ Compatibility across heterogeneous environments
complex. Additionally, because these ❺ Fallback
operational systems are core to the ❻ Latency
organization’s primary business processes,
their data should be included in the The following six best practices will ensure
analytical environment. a smooth data integration that effectively
delivers high performance, security and
Migrations of operational systems require availability across all heterogeneous
considerable testing to ensure that the cloud data sources.
environment has similar or better performance
than the existing, often on-prem, systems. 1. Learn the impact on operations
C
These systems must have the same or higher
Organizations need the data in operational
levels of availability and data must be secure.
systems that drive the business for
Environments also have to perform well under
consolidatedanalytical environments.
heavy user load.
For example, if you are a manufacturer who
The IT team must perform further testing to
wants to sell your customers preventive
ensure everything works properly after the
maintenance solutions, you need access to
migration. Integrations must be rewritten to
factory data. What items are planned to be
work with the new system, too. Depending on
produced when and where? How do you get
them to your customers? Do they need one
of your experts to help with installation?
6 best practices for cloud data integration 2
Operational systems drive the primary 2. Increase network efficiency
C
business process. As much as we want access
With widespread access to high-speed
to data, we don't want these systems to
connectivity, is network efficiency still
slow down. Therefore, we must find a solution
relevant? Consider these aspects:
that captures the changes going into your
operational system with minimal overhead.
❶ Network bandwidth is finite. When
bandwidth reaches its peak, you cannot
For many self-hosted applications, consider a
use more of it. When you add more load
database-level change data capture (CDC) on a network that has reached capacity,
solution, of which log-based CDC is widely you may increase latency. Higher latency
considered the least intrusive. Because results in lower bandwidth because
critical systems contain the most important network transfers require confirmation.
data to help drive decisions, real-time
access to this data is required to be more ❷ Cloud ingress is free for most cloud
competitive. Log-based CDC handles the providers. However, egress typically
highest volumes of change data in real-time is not. If you transfer data out of a
— enabling organizations to make informed, cloud — and remember hybrid cloud may
data-driven decisions more quickly. involve multiple clouds — you pay less
if you transfer less data.
Depending on where you are in your
hybrid cloud adoption, you may consider To improve data transfer rates beyond
alternatives for your current deployment maximum bandwidth, you can use
model. For example, if you currently host compression: Transfer fewer data based
your organization's ERP in an on-prem on an agreed compression algorithm. If
data center, you may consider cloud- you can achieve 5x compression, then you
hosted or software-as-a-service options. effectively magnified your bandwidth 5x.
Will your future deployment option provide And cloud egress costs are lower by 5x.
a similar level of flexibility as your current
deployment model? As you optimize for no Likewise, using CDC relative to full extracts
operational impact on your environment today, will limit the required bandwidth you need.
will you be able to leverage the same Even better: Filter the data before it is
solution in the future? If not, do you sent across the wire. For example, use a
have alternative options that address filter condition on data retrieval or an
your data integration requirements? agent to identify the required changes.
Also, you may need the primary system Lastly, consider the efficiency of network
to synchronize with the new or old communication. Sending fewer large
configuration during the migration to allow data blocks is better than transferring
for testing and provide a fallback option. many small data blocks. Across a wide
area network (WAN), you want to avoid
a chatty communication protocol.
6 best practices for cloud data integration 3
3. Don’t assume all communication Finally, validate the vendor’s security
C
certifications. Organizations send out long
is behind a firewall
security surveys to determine vendor approval.
Traditionally, data transfers between systems Many questions receive default answers
have taken place within the confines of data based on the deployment model, or through
centers. As organizations built out disaster industry certifications.
recovery environments, multiple data centers
were implemented with direct connectivity Look for a cloud vendor/provider who has
among them, still behind a corporate firewall. SOC2 type 2 and/or ISO 27001 certifications.
As we adopt cloud technologies, we can no
4. Remember everything changes
C
longer assume that all communication is behind
a firewall managed by our network team. Of While you may appreciate the
course, there are PrivateLink and Direct Connect configuration your organization uses, it
options with cloud providers. However, many will change — systems wear down and
SaaS platforms simply fall outside of these. must get replaced. Software solutions
become incompatible with infrastructure.
Your organization may not be willing or Your organization may decide to use
able to invest in Direct Connect to hook different solutions. Or, in the context of
up on-premises systems with the cloud hybrid cloud, you may decide to change
environments you have the ability to the deployment configuration.
access. As a result, a security condition may Most hybrid cloud deployments are mixed
develop as communication is exposed. environments with a range of heterogeneous
on-premises systems and diverse cloud
The first consideration is to use encryption services. Obviously, any data integration
whenever possible. Given your organization solution must work in the initial environment.
has no control over the end-to-end network
connection, you want to look for application- The environments are also likely to change over
level encryption. Ensure the technology you time. The cloud provides flexibility because
use encrypts data using TLS 1.2 or higher. of its technology or services and its “pay
for use” pricing. It’s relatively easy to switch
A second consideration is to lock down deployment platforms quickly, and a platform
firewalls. SaaS platforms may reach out to that works well today may not be tomorrow’s
pull data. Lock down the firewall to just the platform of choice. Organizations continually
IP addresses the vendor uses. Or, even better, evaluate solutions, and available technology
use a solution that reaches out from your (on- options often shift. An application running
prem) environment into the cloud. From there, on a relational database today, deployed in
based on the stateful properties of a firewall, the cloud or using a database as a service,
bi-directional communication can then begin. may one day be replaced by a SaaS solution.
6 best practices for cloud data integration 4
A best practice is to use a data integration choices, and offers the flexibility to make
technology that ensures compatibility across changes in a heterogeneous environment.
the broadest possible array of databases, file Since the cloud provides many options
systems, applications, platforms and cloud to store data, the ability to quickly add
services — including IaaS, PaaS and SaaS. Such destinations to an existing data
a solution delivers a wide range of deployment integration flow is an added bonus.
CUSTOMER STORY
An industry leader in water processing added for operational reporting. The on-
technologies went through such a transition, premises SAP system running on Oracle was
partly driven by a spinout. Like many moved into a cloud-hosted environment on
companies around the world, SAP ECC was its AWS. The data warehouse was augmented
core ERP system. It used SAP data in Oracle with a data lake hosted on cloud storage,
databases for forecasting of purchasing, replacing a lot of the functions the data
materials and inventory to assist in improving warehouse previously performed.
business decisions.
One of the few technologies that did not
The enterprise faced a number of challenges: change throughout the transitions was
the data replication between sources
❶ Bulk extracts put a massive load on the
and destinations. Fivetran’s high-volume
source SAP transactional database.
data replication solution provided the
flexibility to support different sources
❷ Data needed to be fresher; ETL latency
was too long and only worked for historical and destinations to meet the customer’s
reporting, not real-time analytics. needs throughout the transitions
❸ Detecting deletes was labor-intensive and
inconsistent resulting in some bad data.
❹ Analytics on Oracle was slow because
tables contained a very large number
of rows and columns.
An on-premises data warehouse solution,
limited to a single node, was replaced with
a scalable cloud-based relational data
warehouse, AWS Redshift. An additional cloud-
based database as a service destination was
6 best practices for cloud data integration 5
5. Consider bi-directional data movement organizations want to see initial successes to
C
feel comfortable about not needing the old
You may have started your hybrid cloud
system. Others want to see at least a couple
journey with one or more analytical use cases.
of months of successful processing before
Over time you consider what to do with your
giving up the fallback option.
operational environments. The question is not
whether you will run your operational systems
in the cloud. The question is: When will you 6. Low latency is a must-have during
C
decide to do so? migration
Business requirements will determine the
Any migration is daunting, especially one that
maximum allowable latency. Cloud-based
affects your organization’s primary business
environments are built to be available 24x7,
process. What’s going to happen when all
and users (as well as customers) have become
users switch to the new environment?
accustomed to instant access to information.
And, if things don’t work out, what’s your
These combined factors drive organizations to
fallback option?
look for near real-time or continuous data
integration solutions.
Consider bi-directional data movement. You
probably don’t need active/active replication
Consider the competitive differentiation you
because most applications are not prepared
can achieve with consolidated data available
to run in active/active mode. However, running
for analytics closer to real-time. A solution
active/passive replication is a powerful way
you sell to your customers may become more
to mitigate data loss if a fallback is required.
valuable. Your team may become better
Instead of asking users to redo their work or
equipped to identify fraudulent behavior. You
re-run routines that were processed already,
may have opportunities to save costs simply
you replicate the data to the source.
by reacting more quickly.
If the migration is not successful, then you
During the data migration, low latency is a
switch back. Data processing continues with
must. If a critical operational system does not
minimal disruption. How long you keep the
meet expectations post-migration, you want
old system around is a risk assessment. Some
to lose no time and resume processing on the
old environment. However, it must be up to
date with the latest changes.
6 best practices for cloud data integration 6
service (such as your destination going down),
we automatically resume syncing where it
How to continually was left off — even hours or days later, as
integrate data between long as log data is still present. You can also
track deletions to view your archived records.
on-prem and cloud for
real-time analysis The hybrid cloud is a reality for many companies
undergoing digital transformation.
As organizations migrate to the cloud, they’ll
likely need to operate — at least for a
time — in a hybrid environment.
If you want to learn more about our
approach to cloud data integration,
Whether data arrives from a SaaS platform or
sign up for a 14-day free trial and
directly from a database, change data capture
test our system for yourself.
methodology can enable near real-time updates
to the analytical environment. Log-based CDC
— reading changes from a database transaction
log — is widely considered the least-intrusive
method to retrieve database changes.
Fivetran offers CDC as a feature for most
of our connectors to applications — and all
connectors to databases. After the initial
sync of your historical data, Fivetran
performs incremental updates of any new or
modified data from your source system.
We use your database’s native transaction
log during incremental syncs to request only
the data that has changed since our last
sync, including deletes. Each database uses a
different change capture mechanism.
During incremental syncs, Fivetran maintains an
internal set of progress cursors that allow us to
track the exact point where our last successful
sync left off. If there is an interruption in your
6 best practices for cloud data integration 7
Fivetran is the global leader in modern data integration. Our mission is to make access
to data as simple and reliable as electricity. Built for the cloud, Fivetran enables data
teams to effortlessly centralize and transform data from hundreds of SaaS and on-prem
data sources into high-performance cloud destinations. Fast-moving startups to the
world’s largest companies use Fivetran to accelerate modern analytics and operational
efficiency, fueling data-driven business growth. For more info, visit Fivetran.com.
6 best practices for cloud data integration 8