Big Data Analytics in the Cloud with Microsoft Azure

www.globalbigdataconference.com
Twitter : @bigdataconf

Big Data Analytics in the
Cloud
Microsoft Azure
Cortana Intelligence Suite
Mark Kromer
Microsoft Azure Cloud Data Architect
@kromerbigdata
@mssqldude

What is Big Data Analytics?
Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information.”
Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide
variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The
aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that
might provide valuable insights about the users who created it. Through this insight, businesses may be able to
gain an edge over their rivals and make superior business decisions.”
 Requires lots of data wrangling and Data Engineers
 Requires Data Scientists to uncover patterns from
complex raw data
 Requires Business Analysts to provide business value
from multiple data sources
 Requires additional tools and infrastructure not
provided by traditional database and BI technologies
Why Cloud for Big Data Analytics?
• Quick and easy to stand-up new, large, big data architectures
• Elastic scale
• Metered pricing
• Quickly evolve architectures to rapidly changing landscapes
• Prototype, tear down

Big Data Analytics Tools & Use Cases
vs. “Traditional BI”
Traditional BI
• Sales reports
• Post-campaign marketing research & analysis
• CRM reports
• Enterprise data assets
• Can’t miss any transactions, records or rows
• DWs
• Relational Databases
• Well-defined and format data sources
• Direct connections to OLTP and LOB data sources
• Excel
• Well-defined business semantic models
• OLAP cubes
• MDM, Data Quality, Data Governance
Big Data Analytics
• Sentiment Analysis
• Predictive Maintenance
• Churn Analytics
• Customer Analytics
• Real-time marketing
• Avoid simply siphoning off data for BI tools
• Architect multiple paths for data pipelines: speed,
batch, analytical
• Plan for data of varying types, volumes and formats
• Data can/will land at any time, any speed, any format
• It’s OK to miss a few records and data points
• NoSQL
• MPP DWs
• Hadoop, Spark, Storm
• R & ML to find patterns in masses of data lakes

• Key Values / JSON / CSV
• Compress files
• Columnar
• Land raw data fast
• Data Wrangle/Munge/Engineer
• Find patterns
• Prepare for business models
• Present to business decision makers
A few basic fundamentals
Big Data Analytics in the Cloud
Collect and land
data in lake
Process data
pipelines
(stream, batch,
analysis)
Presentation
Layer: Surface
knowledge to
business
decision makers

Azure Data Platform-at-a-glance

Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data

Azure Data Factory
What it is:
When to use it:
A pipeline system to move data in, perform activities on data,
move data around, and move data out
• Create solutions using multiple tools as a single process
• Orchestrate processes - Scheduling
• Monitor and manage pipelines
• Call and re-train Azure ML models

Example – Customer Churn
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Azure Data
Factory:
Data Sources
Customers
Likely to
Churn
Customer
Call Details
Transform & Analyze PublishIngest

Simple ADF
• Business Goal: Transform and Analyze Web Logs each month
• Design Process: Transform Raw Weblogs, using a Hive Query,
storing the results in Blob Storage
Web Logs
Loaded to
Blob
Files ready
for analysis
and use in
AzureML
HDInsight HIVE query
to transform Log
entries

Azure SQL Data Warehouse
What it is:
When to use it:
A Scaling Data Warehouse Service in the Cloud
• When you need a large-data BI solution in the cloud
• MPP SQL Server in the Cloud
• Elastic scale data warehousing
• When you need pause-able scale-out compute

Elastic scale & performance
Real-time elasticity
Resize in <1 minute On-demand compute
Expand or reduce
as needed
Pause Data Warehouse to Save
on Compute Costs. I.e. Pause
during non-business hours

Storage can be as big or
small as required
Users can execute niche workloads
without re-scanning data
Elastic scale & performance
Scale

SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
SELECT COUNT_BIG(*)
SELECT COUNT_BIG(*)
SELECT COUNT_BIG(*)
SELECT COUNT_BIG(*)
Compute
Control

Azure Data Lake
What it is:
When to use it:
Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark,
HBase, Storm, U-SQL) Engines
• Low-cost, high-throughput data store
• Non-relational data
• Larger storage limits than Blobs

Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
and ADLA
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)

No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
Optimized for analytic workload
PERFORMANCE
ENTERPRISE GRADE authentication, access
control, audit, encryption at rest
Azure Data Lake
Store
A hyperscale repositoryfor big
data analyticsworkloads
Introducing ADLS

Enterprise-
grade
Limitless scaleProductivity
from day one
Easy and
powerful data
preparation
All data
23
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100

Developing big data apps
Author, debug, & optimize big
data apps
in Visual Studio
Multiple Languages
U-SQL, Hive, & Pig
Seamlessly integrate .NET

Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM

What is
U-SQL?
A hyper-scalable, highly extensible
language for preparing, transforming and
analyzing all data
Allows users to focus on the what—not
the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
26

U-SQL language philosophy
27
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL
Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;

Expression-flow programming style
Automatic "in-lining" of SQLIP expressions
– whole script leads to a single execution
model
Execution plan that is optimized out-of-
the-box and w/o user intervention
Per-job and user-driven parallelization
Detail visibility into execution steps, for
debugging
Heat map functionality to identify
performance bottlenecks
010010
100100
010101

“Unstructured” Files
• Schema on Read
• Write to File
• Built-in and custom Extractors and
Outputters
• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv;
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"

Expression-flow Programming Style
12
• Automatic "in-lining" of U-SQL expressions –
whole script leads to a single execution
model.
• Execution plan that is optimized out-of-the-
box and w/o user intervention.
• Per job and user driven level of
parallelization.
• Detail visibility into execution steps, for
debugging.
• Heatmap like functionality to identify
performance bottlenecks.

What can you do with Visual Studio?
32
Visualize and
replay progress
of job
Fine-tune query
performance
Visualize physical
plan of U-SQL
query
Browse metadata
catalog
Author U-SQL
scripts (with
C# code)
Create metadata
objects
Submit and cancel
U-SQL Jobs
Debug U-SQL and
C# code

Authoring U-SQL queries
34
Visual Studio fully supports
authoring U-SQL scripts
While editing, it provides:
IntelliSense
Syntax color coding
Syntax checking
…
Contextual
Menu

Job execution graph
35
After a job is submitted
the progress of the
execution of the job as it
goes through the
different stages is shown
and updated
continuously
Important stats about the
job are also displayed
and updated
continuously

Job diagnostics
Diagnostics information
is shown to help with
debugging and
performance issues

HDInsight: Cloud Managed Hadoop
What it is:
When to use it:
Microsoft’s implementation of apache Hadoop (as a service)
that uses Blobs for persistent storage
• When you need to process large scale data (PB+)
• When you want to use Hadoop or Spark as a service
• When you want to compute data and retire the servers, but
retain the results
• When your team is familiar with the Hadoop Zoo

Hadoop and HDInsight
Using the Hadoop Ecosystem to
process and query data

HDInsight Tools for Visual Studio

Deploying HDInsight Clusters
• Cluster Type: Hadoop, Spark, HBase and Storm.
• Hadoop clusters: for query and analysis workloads
• HBase clusters: for NoSQL workloads
• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads
• Operating System: Windows or Linux
• Can be deployed from Azure portal, Azure Command Line
Interface (CLI), or Azure PowerShell and Visual Studio
• A UI dashboard is provided to the cluster through Ambari.
• Remote Access through SSH, REST API, ODBC, JDBC.
• Remote Desktop (RDP) access for Windows clusters

Azure Machine Learning
What it is:
When to use it:
A multi-platform environment and engine to create and deploy
Machine Learning models and API’s
• When you need to create predictive analytics
• When you need to share Data Science experiments across
teams
• When you need to create call-able API’s for ML functions
• When you also have R and Python experience on your Data
Science team

Creating an Experiment
Get/Prepare
Data
Build/Edit
Experiment
Create/Update
Model
Evaluate Model
Results
Build and ModelCreate
Workspace
Deploy
Model
Consume
Model

Basic Azure ML Elements
Import Data
Preprocess
Algorithm
Train Model
Split Data
Score Model

Power BI
What it is:
When to use it:
Interactive Report and Visualization creation for computing
and mobile platforms
• When you need to create and view interactive reports that
combine multiple datasets
• When you need to embed reporting into an application
• When you need customizable visualizations
• When you need to create shared datasets, reports, and
dashboards that you publish to your team

Big Data Analytics – Data Flow

Event Ingestion Patterns
Business
apps
Custom
apps
Sensors
and devices
Events Events
Azure Data Lake Store
Transformed
Data
Raw Events
Azure Event
Hubs
Kafka

Bulk Ingestion and Preparation
Business
apps
Custom
apps
Sensors
and devices
Bulk Load
Azure Data Factory

Data
Transformation
Data
Collection
Presentation
and action
Queuing
System
Data Storage
Big Data Lambda Architecture
Azure Search
Data analytics (Excel,
Power BI, Looker,
Tableau)
Web/thick client
dashboards
Devices to take action
Event hub
Event & data
producers
Applications
Web and social
Devices
Live Dashboards
DocumentDB
MongoDB
SQL Azure
ADW
Hbase
Blob StorageKafka/RabbitMQ/
ActiveMQ
Event hubs Azure ML
Storm / Stream
Analytics
Hive / U-SQL
Data Factory
Sensors
Pig
Cloud gateways
(web APIs)
Field
gateways

Get started
today!
http://aka.ms/cisolutions
57
Cortana Intelligence Solutions

Cortana Intelligence Solutions: Discover
http://aka.ms/cisolutions

Cortana Intelligence Solutions: Try

Cortana Intelligence Solutions: Deploy

Big Data Analytics in the Cloud with Microsoft Azure

Big Data Analytics in the Cloud with Microsoft Azure

In this document