Databricks, an Introduction
Chuck Connell, Insight Digital Innovation
Insight Presentation
Speaker Bio
• Senior Data Architect at Insight Digital Innovation
• Focus on Azure big data services – HDInsight/Hadoop,
Databricks, Cosmos DB
• Related work…
• NoSQL and relational data models, transitions
• Size/volume estimates
• JSON schemas for schema-less data
• Boston office in Watertown Sq, on the river walk
Creating meaningful connections
that help businesses run smarter.
Supply Chain Connected Cloud & Data Center Digital
Optimization Workforce Transformation Innovation
We help you invest We create a We help you prepare for the We help you innovate
smarter so you can connected workplace future and align workloads smarter so you can
manage today and so employees can to the right platforms. make meaningful
transform the future. work smarter. connections.
Talk Outline
• Brief history -- what came before, why Databricks
• Spin up a Databricks instance, verify it
• Add some data
• SQL table operations
• DataFrame operations
• DB “connections”, getting data in and out
• Other cool things you can do with Databricks
• Caveats – what is not perfect about Databricks
• Q&A
In the Dark Ages (1960-2005)
• A long, long time ago database systems ran on a single
computer with some associated storage
• The computers and storage got bigger and faster every
year, but the basic architecture remained the same
• If you wanted answers faster, you bought a better
computer. If you wanted to store more data, you bought a
more expensive storage system
• If the speed you desired or the amount of data you had
exceeded the capacity of the best available hardware, you
were out of luck; you simply could not create such a
database system.
Hadoop
• In 2006, Doug Cutting et al at Yahoo created Hadoop
• Unlimited horizontal scaling on cheap computers!
• Key ideas…
• HDFS, all disks became one file system
• MapReduce, a way to run parallel code on all the CPUs
• Soon there were Hadoop clusters with 100s of nodes, then
1000s
• You could do database things that were simply impossible
before!
• But there is no free lunch
• What were the main drawbacks to Hadoop?
Hadoop Drawbacks
• MapReduce is hard to program
• A new way of thinking about coding
• Does not magically parallelize algorithms for you
• Requires trial/error, tuning, multiple stages of MR
• Hadoop wrote MR intermediate results to disk
• Often many times for one job
• Much slower than a memory write/read
• Hadoop was a “batch” system, not interactive queries
Hive
• Introduced in 2009, it solved the "hard to program"
problem
• SQL abstraction on top of HDFS and MapReduce
• Data appears as normal relational-like tables
• Database jobs can be written in SQL
• Essentially a compiler that translates SQL into Java
MapReduce code
• Generated code usually better than human would create
• But still… lots of disk I/O
• Simple Hive query on small table = ~15 secs
Spark
• In 2011, Spark project to solve Hadoop disk I/O problem
• Goal: do as many operations as possible completely within
memory
• Spark delivered 10 – 100x speedup on fewer machines
• But alas, still no free lunch
• What are the key problems with Spark?
Spark, Issues
• Complexity
• Software installs
• Hardware clusters
• File system setup
• Performance tuning
• Security
• Clusters
• Code / Jobs
• Data
• Enter Databricks (2015)
Databricks
• Databricks is a way to use Spark more conveniently
• Databricks is Spark, but with a GUI and many automated
features
• Creation and configuration of server clusters
• Auto-scaling and shutdown of clusters
• Connections to various file systems and formats
• Programming interfaces for Python, Scala, SQL, R
• Integration with other Azure services
• Available only as a cloud service, both Amazon and Azure
• Let’s dive in…
Start Databricks, Azure
Start Databricks, Azure
Start Databricks, Azure
Start Databricks, Community
Begin live demo….
Make a Cluster
Cluster Running
Create/Attach a Notebook
Verify Databricks Resource
• That’s it!
• You have a fully running, auto-scaling Spark cluster
• Latest (synced) software releases, well tuned
• Friendly GUI, in your favorite language!
• Has anyone done this same operation manually for Spark?
• How long did it take?
• What are the complexities?
Data Import
• Persistent tables, stored in Databricks File System (DBFS)
• Backed by Azure Blob
• Cached in Databricks memory as needed
• We use CSV crime data from data.boston.gov, but many
possible sources…
Create Tables
Create Tables
Verify Tables
Standard SQL
Standard SQL
Visualize Data
Join Tables
DataFrame vs Table
• DataFrame is the default data abstraction
• Stored in Spark runtime memory
• When needed, can be persisted to DBFS (Parquet serialization is
default) or SQL table
• DataFrame operations a super-set of SQL, also includes
• ETL -- dropna, na.fill, explode, moved mal-formed elsewhere
• Convert to GraphFrame
• Submit to ML libraries
DataFrame Example
Import Library / GraphFrames
• Databricks Home / Import Library / Maven / Search / Spark
Packages / graph
• graphframes:graphframes:0.6.0-spark2.3-s_2.11
Databricks Connections
• Getting data in
• CSV, JSON, Parquet, LZO, Zip, Avro
• Hive tables
• Azure Blob or Data Lake as DBFS directory
• Any RDBMS with JDBC
• Azure Data Hub, which has many source connectors
• Getting data out
• Write to many file formats
• JBDC and ODBC for programmatic inbound reads
• REST API
• Clusters, DBFS, jobs, libraries, workspaces…
Databricks Goodies
• Databricks Delta, ACID compliant transactions
• Security integration with Azure Active Directory
• See my article on LinkedIn for details
• GraphFrames
• A library of routines for creating and calculating node/edge data
structures
• Ex: shortest path, PageRank
• Machine Learning
• A library and workflow for many common ML techniques
• Support for many third-party ML libs – H2O, scikit-learn, DataRobot,
XGBoost
• R language
Caveats
• No option for local install, so no “hybrid cloud” option
• Databricks not in Azure Stack (afaik)
• Spark/Databricks relatively slow for small data sets
• Key-value stores (Redis, Couchbase) have <1ms response
• RDBMS have few ms response for tuned SQL queries
• Fastest Spark query is ~400ms
• Interesting tradeoffs for specific use-cases (1M vs 1T rows)
• Overall "fit and finish" within Azure
• Control of allocation within resource groups
• Programmatic creation of base Azure Databricks resource for
DevOps CI/CD.
Next Steps
• https://databricks.com/spark/comparing-databricks-to-apache-spark
(Databricks vs Spark)
• http://community.cloud.databricks.com (Community edition)
• https://azure.microsoft.com/en-us/services/databricks (Azure
Databricks)
• https://academy.databricks.com (Databricks training)
• https://docs.databricks.com/spark/latest/mllib/index.html (Machine
learning)
• https://docs.databricks.com/spark/latest/graph-
analysis/graphframes/index.html (GraphFrames)
• https://docs.databricks.com/api/latest/index.html (REST API)
Questions / Discussion…. ??
Thank You
Insight Presentation