KEMBAR78
Databricks 2 | PDF | Apache Spark | Microsoft Azure
0% found this document useful (0 votes)
56 views22 pages

Databricks 2

This document summarizes a presentation about Azure Databricks given by Eugene Polonichko. The presentation covers: 1. What is Azure Databricks - it is an Apache Spark-based analytics platform optimized for Microsoft Azure that provides one-click setup and an interactive workspace for collaboration. 2. The components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks file system. 3. How Azure Databricks can benefit data engineers with scenarios and pricing information.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views22 pages

Databricks 2

This document summarizes a presentation about Azure Databricks given by Eugene Polonichko. The presentation covers: 1. What is Azure Databricks - it is an Apache Spark-based analytics platform optimized for Microsoft Azure that provides one-click setup and an interactive workspace for collaboration. 2. The components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks file system. 3. How Azure Databricks can benefit data engineers with scenarios and pricing information.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2018 Ukraine

Azure DataBricks for Data


Engineering
Eugene Polonichko
Senior Software Developer at Eleks,
Data Platform MVP
https://www.linkedin.com/in/eugenepolonichko
/
About me
Eugene Polonichko has over 7 years of experience
with SQL Server. He mainly focused on BI projects
(SSAS, SSIS, PowerBI, Cognos, Informatica
PowerCenter, Pentaho, Tableau). Eugene is a
passionate speaker and SQL community volunteer
presenting regularly at PASS SQL Saturday events
and local user groups around Ukraine and Europe.
Eugene is PASS Chapter Leader and he has a status
MVP Data Platform
https://www.linkedin.com/in/eugenepolonichko/
https://twitter.com/EvgenPolonichko
Agenda
1. What is Azure Databricks?
• Azure Databricks
• Apache Spark
• Componets of Apache Spark
• Architecture of Azure Databricks
• Azure integration
2. Azure Databricks
• Cluster
• Workspace
• Notebooks
• Visualizations
• Jobs and Alerts
• Databricks File System
• Business Intelligence Tools
3. For data engineer
• Scenario
• Prices
What is Azure Databricks?
Azure Databricks
Azure Databricks is an Apache Spark-
based analytics platform optimized for
the Microsoft Azure cloud services
platform. Designed with the founders of
Apache Spark, Databricks is integrated
with Azure to provide one-click setup,
streamlined workflows, and an interactive
workspace that enables collaboration
between data scientists, data engineers,
and business analysts.
Apache Spark-based analytics platform
Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities.
Spark in Azure Databricks includes the following components
Apache Spark-based analytics platform
• Spark SQL and DataFrames: Spark SQL is the Spark module for working with
structured data
• Streaming: Real-time data processing and analysis for analytical and
interactive applications. Integrates with HDFS, Flume, and Kafka.
• MLib: Machine Learning library consisting of common learning algorithms
and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization
primitives.
• GraphX: Graphs and graph computation for a broad scope of use cases
from cognitive analytics to data exploration.
• Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
Architecture of Azure Databricks
Total Azure integration
• Diversity of VM types
• Security and Privacy
• Flexibility in network topology
• Azure Storage and Azure Data Lake integration
• Azure Power BI
• Azure Active Directory
• Azure SQL Data Warehouse, Azure SQL DB, and
Azure CosmosDB:
Azure Databricks
Clusters
Azure Databricks clusters provide a unified platform for various use cases such as running production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Job

Interactive
Workspace
The Workspace is the special root folder for all of
your organization’s Azure Databricks assets.
The Workspace stores:
• notebooks
• libraries
• dashboards
• folders
Notebooks
A notebook is a web-based interface to a document that
contains runnable code, visualizations, and narrative text.
• Create a notebook
• Delete a notebook
• Control access to a notebook
• Notebook external formats
• Notebooks and clusters
• Schedule a notebook
• Distributing notebooks
Visualizations
Databricks supports a display(<dataframe-name>)

number of visualizations out


of the box.
All notebooks, regardless of
their language, support
Databricks visualization
using the display function.
Jobs and Alerts
A job is a way of
running a
notebook or JAR
either immediately
or on a scheduled The number of jobs is limited to 1000.
basis
Alerts
You can set up email
alerts for job runs. You
can send alerts up job
start, job success, and job
failure (including skipped
jobs), providing multiple
comma-separated email
addresses for each alert
type. You can also opt out
of alerts for skipped job
runs.
Databricks File System
Databricks File System (DBFS) is a You can access files in DBFS
distributed file system installed on using the Databricks CLI,
Databricks Runtime clusters. Files in DBFS API, Databricks
DBFS persist to Azure Blob storage Utilities, Spark APIs, and local
file APIs.
Python
# List files in DBFS Copy
dbfs ls #write a file to DBFS using python i/o apis
# Put local file ./apple.txt to dbfs:/apple.txt with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
f.write("Apache Spark is awesome!\n")
dbfs cp ./apple.txt dbfs:/apple.txt
f.write("End of example!")
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt # read the file
# Recursively put local dir ./banana to dbfs:/banana with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
dbfs cp -r ./banana dbfs:/banana for line in f_read:
print line
Business Intelligence Tools
Business Intelligence (BI) tools can
connect to Azure Databricks clusters
to query data in tables. Every Azure
Databricks cluster runs a
JDBC/ODBC server on the driver
node. This section provides general
instructions for connecting BI tools
to Azure Databricks clusters, along
with specific instructions for
popular BI tools.
For Data Engineers
Scenario
Scenario
Thank you

You might also like