KEMBAR78
Simplifying Data Engineering Databricks | PDF | Apache Spark | Analytics
100% found this document useful (1 vote)
699 views20 pages

Simplifying Data Engineering Databricks

Data engineering

Uploaded by

Huy Hóm Hỉnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
699 views20 pages

Simplifying Data Engineering Databricks

Data engineering

Uploaded by

Huy Hóm Hỉnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Simplifying

Data Engineering
to Accelerate
Innovation
The Rise of Data Engineering
With the continued growth in data generated and captured
by companies across industries, the market for big data
analytics capabilities is becoming more mainstream.
Further amplifying this trend is the rapid ascension of
modern technologies designed to help organizations
harness, manage, and ultimately derive value from
this data.

As a result, data engineering has quickly become one


of the fastest growing functions within data driven
organizations. As companies set their sights on making
data-driven decisions or automating business processes
with intelligent algorithms, mastering data engineering is
an essential step.

2
Primary Data Engineering Challenges
SILOED DATA COMPLEX INFRASTRUCTURE
Data often exist in disparate silos, making it difficult to access and A primary element of any big data project involves having to build
ETL the data in a format that the data science and analyst teams can and operate the supporting data infrastructure to operationalize your
leverage — resulting in the inability to extract holistic insights that can deployment. Business-critical data pipelines that go down, whether
lead to inaccurate machine learning models and misinformed decisions. ETL or feature engineering, has the potential to cost millions of
dollars in lost revenue.

PERFORMANCE AT ANY SCALE DEMANDS OF CONTINUOUS APPLICATIONS


As the volume and variety of data grows to support more As organizations collect massive amounts of data on a continual basis,
sophisticated use cases, the data infrastructure must be able to adapt the ability to extract actionable insights in a timely fashion becomes
to workload changes and handle fast growing data volumes — both critical to success. Developers must be armed with the means to
structured and unstructured — to ensure efficient compute usage and perform complex mission-critical data cleansing, transformations, and
infrastructure costs at any scale. manipulations on data from various sources and complex formats, all
while ensuring fault-tolerance across the entire pipeline.

3
Keys to Better Data Engineering
There are three primary keys data engineers require First, the need for the system to be production ready is important in

to ensure they can effectively support their data science terms of stability and security. It’s critical to build a data pipeline that
is reliable and secure. Data engineering teams must be able to not only
colleagues and the overall business.
prevent outages through troubleshooting, but also ensure the necessary
data protection to meet security and compliance standards.

Second, the ability to process big data at breakneck speeds can help
a business innovate and drive favorable business outcomes faster.
Optimizing the various steps of data engineering is essential to improve
process efficiency that leads to faster delivery of impactful business
outcomes.

Last, the ability to easily integrate with existing infrastructure from data
stores like MongoDB to workflow management tools like Airflow can
reduce complexity and speed the process from ingest to production. This
greatly reduces the burden on DevOps, allowing data engineering teams to
focus on higher valued activities that support the business’ focus on driving
innovation.

4
The Fastest Data Processing Engine Around
Apache Spark™ is an open source data processing engine built for
speed, ease of use, and sophisticated analytics. Since its release,
Spark has seen rapid adoption by enterprises across a wide range of
industries. Internet powerhouses such as Facebook, Netflix, Yahoo,
Baidu, and eBay have eagerly deployed Spark at massive scale.

As a general purpose compute engine designed for distributed


processing, Spark is used for many types of data processing. It
ETL SQL Analytics Machine Learning Streaming
supports ETL, interactive queries (SQL), advanced analytics (e.g.
machine learning) and structured streaming over large datasets.
For loading and storing data, Spark integrates with many storage
systems (e.g. HDFS, Cassandra, MySQL, HBase, MongoDB, S3). Spark
Spark Core API is also pluggable, with dozens of applications, data sources, and
environments, forming an extensible open source ecosystem.

R SQL Python Scala Java Additionally, Spark supports a variety of popular development
languages including R, Java, Python and Scala.

5
How Enterprises Deploy Apache Spark
Facebook Chose Spark for Performance and Flexibility
Facebook recently transitioned off of Hive to Spark for large-scale language model training.

Trained a large language model on Spark was Read the Facebook blog

15x
more data vs. Hive
2.5x
faster than Hive to learn more >

Netflix uses Apache Spark for real-time stream One of the world’s largest e-commerce platform eBay uses Apache Spark to provide targeted
processing to provide online recommendations to Alibaba Taobao runs some of the largest Apache offers, enhance customer experience, and to
its customers. Spark jobs in the world, analyzing hundreds of optimize the overall performance.
petabytes of data on its ecommerce platform.

6
Databricks for Data Engineering:
The best place to run Apache Spark
Founded by the team that created Apache Spark,
Databricks’ Unified Analytics Platform accelerates
innovation across data science, data engineering, and
the business.

Databricks provides a fully managed, scalable,


and secure cloud infrastructure that helps data
engineers build and run faster and more reliable
production-ready data pipelines — reducing
operational complexity and total cost of
ownership.

DATABRICKS’ UNIFIED ANALYTICS PLATFORM

7
Alleviate Infrastructure Complexity Headaches

Infrastructure teams can stop fighting complexity Databricks offers ultimate flexibility by supporting all versions of Spark
and the ability to run different Spark clusters to meet your workload
and start focusing on customer-facing applications
needs — ensuring your data engineering workloads don’t get in the way
by getting out of the business of maintaining big data
of interactive queries run by your data science colleagues.
infrastructure. Databricks’ serverless, fully-managed,
and highly elastic cloud service completely abstracts When outages and performance degradations occur, data engineers
the infrastructure complexity and the need for can easily monitor the health of Spark jobs and debug issues with easily
accessible end-to-end logs in AWS S3 via the Spark UI. And because
specialized expertise to setup and configure your
Databricks has the industry’s leading Spark experts, the service is fine-
data infrastructure.
tuned to ensure ultra-reliable speed and reliability at scale.

SUPPORTS MULTIPLE SPARK EASILY FINELY-TUNED


ALL SPARK CLUSTERS TO SUIT ACCESSIBLE SPARK FOR SPEED
VERSIONS WORKLOADS END-TO-END LOGS AND RELIABILITY

8
Faster Performance:
Databricks Runtime Powered by Spark
For Data Engineers, it’s critical to process data no matter the scale as quickly These value-added capabilities will increase your
as possible. Apache Spark is the proven processing engine faster than any
performance and reduce your TCO for managing Spark.
other big data processing technology available.

In fact, in a recent performance comparison using the TPC-DS industry standard


Databricks has taken Spark performance to another benchmark, Databricks outperformed other leading big data SQL platforms,
level through Databricks Runtime. demonstrating superior performance across the board.

Databricks Runtime is built on top of Spark and natively built for the cloud.
Through various optimizations at the I/O layer and processing layer (Databricks
I/O), we’ve made Spark faster and more performant. Recent benchmarks clock
Databricks at a rate of 5x faster than vanilla Spark on AWS. Our Spark expertise is
a huge differentiator in ensuring superior performance and very high reliability.

5x
faster
THAN REGULAR SPARK

FASTER DATA PROCESSING + THE CLOUD = LOWER COMPUTE AND STORAGE COSTS

9
Keep Data Safe and Secure
They say all press is good press, but a headline stating
the company has lost valuable data is never good
press. When a breach happens the enterprise grinds
to a halt, and innovation and time-to-market is out
the window. Databricks takes security very seriously,
and by providing a common user interface as well as SOC2 TYPE 2 & AUTOMATIC
HIPAA COMPLIANT ENCRYPTION
integrated technology set, data is protected thanks
to a unified security model with fine grained access
controls across the entire stack (such as data,
clusters, and jobs) and automatically encrypt and
scale local storage. END-TO-END
AUDITING

SINGLE IDENTITY ACCESS


SIGN-ON MANAGEMENT

10
Lower Costs

Databricks’ performance-tuned Apache Spark


clusters allow you to complete jobs in a shorter
time, reducing cloud compute costs.

The fully-managed Databricks Spark clusters enable you


to further reduce costs by avoiding time-consuming
tasks to build, configure, and maintain complex
Spark infrastructure.

In addition to being able to use spot instances,


va
Databricks clusters can also automatically scale to
dynamic workloads. Moreover, Databricks bills your
usage at the minute-level, ensuring you only pay for
the resources you use.

11
5 Customer Case Studies:
Productionizing Data Pipelines Effortlessly
Many of our customers faced the aforementioned
challenges when it came to data engineering tasks
that impacted process efficiency and slowed the
ability for the business and data science teams to
glean insights from all the data.

The following case studies highlight how some of our


customers — across all verticals — have leveraged
Databricks to simplify data engineering and accelerate
their ability to build reliable and highly performant
data pipelines, allowing the business to leverage
data-driven insights to fuel innovation and reduce
overall costs.

12
Case Study: Advertising Technology
CHALLENGE
Eyeview’s legacy data platform struggled to scale their infrastructure to meet
Eyeview is a video advertising technology company that business growth because:
provides brands with a higher return-on-investment on • Surging data volumes caused ETL jobs and query performance to slow down
their video advertising spend. beyond acceptable performance requirements.

• Cost and labor resources to operationalize the infrastructure in support of


USE CASE increased demand became prohibitive.
Data is an integral part of Eyeview’s platform, enabling the planning and • Lack of native support for machine learning critical for competitive
optimization of video advertising campaigns for Eyeview’s customers. differentiation.

Eyeview extracts consumer knowledge and business intelligence data DATABRICKS SOLUTION
from first and third party sources to create thousands of ad permutations Eyeview selected the Databricks Unified Analytics Platform for just-in-time data
for different audiences, personalizing the ads based on factors such as warehousing and to deploy machine learning models into production.
location, shopping habits, and browsing history.
• Simplify provisioning of Spark clusters to automatically scale based on usage.

Due to the scale and complexity of the processing necessary to achieve • Further reduce infrastructure costs through the use of auto-scaling and spot
this level of personalization, Eyeview needed to ensure that the instances.
technology foundation of its platform was capable of efficiently scaling to • Scale its compute and storage resources independently, providing high
support massive volumes of data and incorporating predictive analytics performance at a much lower cost.
through high-performing machine learning models.
• Effortlessly perform real-time ad hoc analysis and implement machine
learning models.

13
CONTINUED

BENEFITS

“ 
Databricks is our go-to-system for anything requiring deep
• Reduced query times on large data sets by a factor of 10, allowing data
analysts to regain 20 percent of their workday from waiting for results.
data processing and analysis. In just a short amount of time,
• Sped up data processing by fourfold without incurring additional we have been able to increase our data processing speeds


operational costs.
by a factor of four without any added operational costs.
• Doubled the pace of product feature development, from prototyping to
— Gal Barnea, CTO, Eyeview
deployment, by increasing the productivity of the engineering team with
faster and easier management of Apache Spark clusters.

14
Case Study: Travel and Hospitality
CHALLENGE
Dealing with large volumes of structured and unstructured data, HomeAway

HomeAway allows travelers to search for vacation rentals in spent too much time on DevOps work building and maintaining infrastructure
with open source Apache Spark and Zeppelin notebooks.
desired destinations. To facilitate a match between traveler
and vacation rental, HomeAway must show search results DATABRICKS SOLUTION
that are relevant to the traveler’s specific interests. HomeAway replaced its homegrown environment with Databricks to simplify
the management of their Spark infrastructure through its native access to S3,
USE CASE interactive notebooks, and cluster management capabilities.
HomeAway, a subsidiary of Expedia, is one of the world’s leading online
marketplaces for the vacation rental industry, with websites representing BENEFITS
over one million paid listings of vacation rental homes in 190 countries. • Reduced query time of over one million documents from over one
Travelers use their websites and mobile applications to search for week to 24 hours.
vacation rentals in desired destinations.
• Reduced over-reliance on DevOps team, increasing data science
productivity by 4x.
To achieve ideal results that enhance the user experience and drive
conversions, HomeAway leverages machine learning to first comb • Automated the execution of microservices via Databricks’ REST APIs.
through various data to deliver accurate search results, then they
leverage context classification techniques to associated the right images

“ Databricks takes the pain of cluster management


based on search term.

away so we can focus on the data and not DevOps.


— Brent Schneeman, Principal Data Scientist, HomeAway

15
Case Study: Health and Fitness / IoT
DATABRICKS SOLUTION
MyFitnessPal chose Databricks to harness the power of Apache Spark and to build
the data pipeline for “Verified Foods” to successfully deliver the feature to their
MyFitnessPal aims to build the largest health and fitness users while gaining many additional benefits.
community online by helping people achieve healthier
lifestyles through better diet and increased exercise. BENEFITS
• Ten-fold speed improvement over previous data pipeline implementation.
USE CASE
• Four times more projects completed in the past quarter resulting from an
MyFitnessPal, part of Under Armour, aims to build the largest health and
increase in team productivity.
fitness community online by helping people achieve healthier lifestyles
through better diet and increased exercise. • Improved team efficiency achieved through accessible advanced analytics and
better code re-use.
One of the most critical data products used by the MyFitnessPal
community is the food database which helps people to quickly find and
log everything they eat. To support this product, they created a new
feature called “Verified Foods”. “ Databricks helped us deliver a new feature to market while
improving the performance of the data pipeline ten-fold.
We would not have been able to fully harness the power
CHALLENGE
of Apache Spark to deliver the feature to market without


• T
 he development of new features within their application demanded
a faster data pipeline to process streams of unstructured data and to
Databricks.
execute a number of highly sophisticated machine learning algorithms. — Chul Lee, Director of Data Engineering & Science, MyFitnessPal
• T
 heir legacy non-distributed Java-based data pipeline was slow, did not
scale, and lacked flexibility.

16
Case Study: Advertising Technology
DATABRICKS SOLUTION
Databricks offered significant data engineering benefits for Sharethrough,
including:
Sharethrough builds software for delivering ads into the
natural flow of content sites and apps (also known as • faster prototyping of new applications

native advertising). • easier debugging of complex pipelines

• improved engineering productivity


USE CASE
Since Sharethrough serves ads on some of the most popular digital
BENEFITS
properties such as Forbes and People, the need for a high-performance big
• Faster prototyping of new applications.
data scale processing platform permeates every aspect of their business.
• Easier debugging of complex pipelines.
Sharethrough offers a robust advertising platform; discovering hidden • Improved overall engineering team productivity by freeing two full time
patterns in data is critical to measuring the effectiveness of their products engineers from infrastructure work.
and in making improvements to the overall product suite.

CHALLENGE
• I nitial attempt to establish a self-hosted Apache Hadoop cluster with
Apache Hive as the ad hoc query tool required two full-time engineers to
“ Thanks to Databricks, our engineers have gone from being
burdened with operations, to having the ability to easily dive
right into analytics. As a result, our team is more productive


manage the infrastructure.
and collaborative with big data than ever.
• T
 heir homegrown system was also not an effective interactive query
platform, creating additional demands on data engineering to build and — Robert Slifka, Vice President of Engineering, Sharethrough
maintain a high performing data pipeline.

17
Case Study: Enterprise Software
CHALLENGE
• Yesware needed a high-performance production data pipeline to power its main
product, which provides customized intelligence to improve the performance of
Yesware enables sales teams to be more effective by
sales teams.
providing detailed analytics about their daily interactions
• The data pipeline built with Apache Pig was too slow, difficult to maintain, and
with potential customers.
not scalable enough for Yesware’s needs.

USE CASE
DATABRICKS SOLUTION
As sales organizations continue to rely more and more on the data-driven
Databricks provided an easier to deploy, faster, more reliable, and more efficient
decision-making approach to improving their sales forecasting and ability
data pipeline; enabling Yesware to gain time improvements for deployment,
to close deals, they are also demanding higher-accuracy data during their
processing speed, and infrastructure efficiency.
decision-making process.
• Yesware took advantage of features including:
Yesware enhances a sales team’s sales cycle by integrating with the team’s
• Spark cluster manager simplified the provisioning of highly optimized Spark
e-mail application to track key metrics. Important data such as the open
clusters simple clicks without DevOps.
rate of e-mail templates, the download rate of attached collateral, and CTA
• Interactive workspace enabled Yesware to prototype Scala code in small quick
click-through rates are analyzed to generate custom reports for the entire
iterations to get the logic right, and then migrate over to a production JAR.
team — allowing sales teams to connect with prospects more effectively,
more easily track customer engagement, and close more deals. • Job scheduler allowed Yesware to instantly deploy code and automatically
monitors the execution of production data pipelines.

• Integrations with a wide variety of data stores to set up a simple but powerful
data pipeline with AWS S3 and Postgres.

18
CONTINUED

BENEFITS

• R
 educed time to deploy production data pipeline from six months to
three weeks. “ 
Databricks proved to be the easiest way to deploy
Apache Spark for Yesware, reducing the time to deploy
• S
 ubstantially sped up compute time, processing twice the amount of a production pipeline from six months to three weeks
data in one-sixth the amount of time. while enabling us to shorten the time to prototype new
• I mproved efficiency of data processing infrastructure by reducing
Amazon Web Services (AWS) costs by 90%.
product features from days to mere hours.
— Justin Mills, Data Team Lead, Yesware

19
Data Engineering, Simplified.
Databricks’ Unified Analytics Platform removes the
complexity of data engineering while accelerating
performance of data engineering tasks from data access
to ETL to production, allowing engineers to build fast and
reliable data pipelines more easily to support the business.

Get started with Databricks for data engineering today.

START YOUR FREE TRIAL

© Databricks 2017. All rights reserved. Apache, Apache Spark, Spark and the Spark logo
are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use 20

You might also like