KEMBAR78
Big Data Analytics in the Cloud with Microsoft Azure | PPTX
www.globalbigdataconference.com
Twitter : @bigdataconf
Big Data Analytics in the
Cloud
Microsoft Azure
Cortana Intelligence Suite
Mark Kromer
Microsoft Azure Cloud Data Architect
@kromerbigdata
@mssqldude
What is Big Data Analytics?
Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information.”
Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide
variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The
aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that
might provide valuable insights about the users who created it. Through this insight, businesses may be able to
gain an edge over their rivals and make superior business decisions.”
 Requires lots of data wrangling and Data Engineers
 Requires Data Scientists to uncover patterns from
complex raw data
 Requires Business Analysts to provide business value
from multiple data sources
 Requires additional tools and infrastructure not
provided by traditional database and BI technologies
Why Cloud for Big Data Analytics?
• Quick and easy to stand-up new, large, big data architectures
• Elastic scale
• Metered pricing
• Quickly evolve architectures to rapidly changing landscapes
• Prototype, tear down
Big Data Analytics Tools & Use Cases
vs. “Traditional BI”
Traditional BI
• Sales reports
• Post-campaign marketing research & analysis
• CRM reports
• Enterprise data assets
• Can’t miss any transactions, records or rows
• DWs
• Relational Databases
• Well-defined and format data sources
• Direct connections to OLTP and LOB data sources
• Excel
• Well-defined business semantic models
• OLAP cubes
• MDM, Data Quality, Data Governance
Big Data Analytics
• Sentiment Analysis
• Predictive Maintenance
• Churn Analytics
• Customer Analytics
• Real-time marketing
• Avoid simply siphoning off data for BI tools
• Architect multiple paths for data pipelines: speed,
batch, analytical
• Plan for data of varying types, volumes and formats
• Data can/will land at any time, any speed, any format
• It’s OK to miss a few records and data points
• NoSQL
• MPP DWs
• Hadoop, Spark, Storm
• R & ML to find patterns in masses of data lakes
• Key Values / JSON / CSV
• Compress files
• Columnar
• Land raw data fast
• Data Wrangle/Munge/Engineer
• Find patterns
• Prepare for business models
• Present to business decision makers
A few basic fundamentals
Big Data Analytics in the Cloud
Collect and land
data in lake
Process data
pipelines
(stream, batch,
analysis)
Presentation
Layer: Surface
knowledge to
business
decision makers
Azure Data Platform-at-a-glance
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
Azure Data Factory
What it is:
When to use it:
A pipeline system to move data in, perform activities on data,
move data around, and move data out
• Create solutions using multiple tools as a single process
• Orchestrate processes - Scheduling
• Monitor and manage pipelines
• Call and re-train Azure ML models
ADF Components
ADF Logical Flow
Example – Customer Churn
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Azure Data
Factory:
Data Sources
Customers
Likely to
Churn
Customer
Call Details
Transform & Analyze PublishIngest
Simple ADF
• Business Goal: Transform and Analyze Web Logs each month
• Design Process: Transform Raw Weblogs, using a Hive Query,
storing the results in Blob Storage
Web Logs
Loaded to
Blob
Files ready
for analysis
and use in
AzureML
HDInsight HIVE query
to transform Log
entries
Azure SQL Data Warehouse
What it is:
When to use it:
A Scaling Data Warehouse Service in the Cloud
• When you need a large-data BI solution in the cloud
• MPP SQL Server in the Cloud
• Elastic scale data warehousing
• When you need pause-able scale-out compute
Elastic scale & performance
Real-time elasticity
Resize in <1 minute On-demand compute
Expand or reduce
as needed
Pause Data Warehouse to Save
on Compute Costs. I.e. Pause
during non-business hours
Storage can be as big or
small as required
Users can execute niche workloads
without re-scanning data
Elastic scale & performance
Scale
Logical overview
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)
FROM dbo.[FactInternetSales];
Compute
Control
Azure Data Lake
What it is:
When to use it:
Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark,
HBase, Storm, U-SQL) Engines
• Low-cost, high-throughput data store
• Non-relational data
• Larger storage limits than Blobs
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
and ADLA
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
Optimized for analytic workload
PERFORMANCE
ENTERPRISE GRADE authentication, access
control, audit, encryption at rest
Azure Data Lake
Store
A hyperscale repositoryfor big
data analyticsworkloads
Introducing ADLS
Enterprise-
grade
Limitless scaleProductivity
from day one
Easy and
powerful data
preparation
All data
23
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Developing big data apps
Author, debug, & optimize big
data apps
in Visual Studio
Multiple Languages
U-SQL, Hive, & Pig
Seamlessly integrate .NET
Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM
What is
U-SQL?
A hyper-scalable, highly extensible
language for preparing, transforming and
analyzing all data
Allows users to focus on the what—not
the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
26
U-SQL language philosophy
27
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL
Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Expression-flow programming style
Automatic "in-lining" of SQLIP expressions
– whole script leads to a single execution
model
Execution plan that is optimized out-of-
the-box and w/o user intervention
Per-job and user-driven parallelization
Detail visibility into execution steps, for
debugging
Heat map functionality to identify
performance bottlenecks
010010
100100
010101
“Unstructured” Files
• Schema on Read
• Write to File
• Built-in and custom Extractors and
Outputters
• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv;
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
Expression-flow Programming Style
12
• Automatic "in-lining" of U-SQL expressions –
whole script leads to a single execution
model.
• Execution plan that is optimized out-of-the-
box and w/o user intervention.
• Per job and user driven level of
parallelization.
• Detail visibility into execution steps, for
debugging.
• Heatmap like functionality to identify
performance bottlenecks.
Visual Studio integration
What can you do with Visual Studio?
32
Visualize and
replay progress
of job
Fine-tune query
performance
Visualize physical
plan of U-SQL
query
Browse metadata
catalog
Author U-SQL
scripts (with
C# code)
Create metadata
objects
Submit and cancel
U-SQL Jobs
Debug U-SQL and
C# code
Plug-in
Authoring U-SQL queries
34
Visual Studio fully supports
authoring U-SQL scripts
While editing, it provides:
IntelliSense
Syntax color coding
Syntax checking
…
Contextual
Menu
Job execution graph
35
After a job is submitted
the progress of the
execution of the job as it
goes through the
different stages is shown
and updated
continuously
Important stats about the
job are also displayed
and updated
continuously
Job diagnostics
Diagnostics information
is shown to help with
debugging and
performance issues
HDInsight: Cloud Managed Hadoop
What it is:
When to use it:
Microsoft’s implementation of apache Hadoop (as a service)
that uses Blobs for persistent storage
• When you need to process large scale data (PB+)
• When you want to use Hadoop or Spark as a service
• When you want to compute data and retire the servers, but
retain the results
• When your team is familiar with the Hadoop Zoo
Hadoop and HDInsight
Using the Hadoop Ecosystem to
process and query data
HDInsight Tools for Visual Studio
Deploying HDInsight Clusters
• Cluster Type: Hadoop, Spark, HBase and Storm.
• Hadoop clusters: for query and analysis workloads
• HBase clusters: for NoSQL workloads
• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads
• Operating System: Windows or Linux
• Can be deployed from Azure portal, Azure Command Line
Interface (CLI), or Azure PowerShell and Visual Studio
• A UI dashboard is provided to the cluster through Ambari.
• Remote Access through SSH, REST API, ODBC, JDBC.
• Remote Desktop (RDP) access for Windows clusters
Azure Machine Learning
What it is:
When to use it:
A multi-platform environment and engine to create and deploy
Machine Learning models and API’s
• When you need to create predictive analytics
• When you need to share Data Science experiments across
teams
• When you need to create call-able API’s for ML functions
• When you also have R and Python experience on your Data
Science team
Creating an Experiment
Get/Prepare
Data
Build/Edit
Experiment
Create/Update
Model
Evaluate Model
Results
Build and ModelCreate
Workspace
Deploy
Model
Consume
Model
Basic Azure ML Elements
Import Data
Preprocess
Algorithm
Train Model
Split Data
Score Model
Power BI
What it is:
When to use it:
Interactive Report and Visualization creation for computing
and mobile platforms
• When you need to create and view interactive reports that
combine multiple datasets
• When you need to embed reporting into an application
• When you need customizable visualizations
• When you need to create shared datasets, reports, and
dashboards that you publish to your team
Common architectural patterns
Big Data Analytics – Data Flow
Event Ingestion Patterns
Business
apps
Custom
apps
Sensors
and devices
Events Events
Azure Data Lake Store
Transformed
Data
Raw Events
Azure Event
Hubs
Kafka
Bulk Ingestion and Preparation
Business
apps
Custom
apps
Sensors
and devices
Bulk Load
Azure Data Factory
Data
Transformation
Data
Collection
Presentation
and action
Queuing
System
Data Storage
Big Data Lambda Architecture
Azure Search
Data analytics (Excel,
Power BI, Looker,
Tableau)
Web/thick client
dashboards
Devices to take action
Event hub
Event & data
producers
Applications
Web and social
Devices
Live Dashboards
DocumentDB
MongoDB
SQL Azure
ADW
Hbase
Blob StorageKafka/RabbitMQ/
ActiveMQ
Event hubs Azure ML
Storm / Stream
Analytics
Hive / U-SQL
Data Factory
Sensors
Pig
Cloud gateways
(web APIs)
Field
gateways
Get started
today!
http://aka.ms/cisolutions
57
Cortana Intelligence Solutions
Cortana Intelligence Solutions: Discover
http://aka.ms/cisolutions
Cortana Intelligence Solutions: Try
Cortana Intelligence Solutions: Deploy
Big Data Analytics in the Cloud with Microsoft Azure

Big Data Analytics in the Cloud with Microsoft Azure

  • 1.
  • 2.
    Big Data Analyticsin the Cloud Microsoft Azure Cortana Intelligence Suite Mark Kromer Microsoft Azure Cloud Data Architect @kromerbigdata @mssqldude
  • 3.
    What is BigData Analytics? Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.” Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.”  Requires lots of data wrangling and Data Engineers  Requires Data Scientists to uncover patterns from complex raw data  Requires Business Analysts to provide business value from multiple data sources  Requires additional tools and infrastructure not provided by traditional database and BI technologies Why Cloud for Big Data Analytics? • Quick and easy to stand-up new, large, big data architectures • Elastic scale • Metered pricing • Quickly evolve architectures to rapidly changing landscapes • Prototype, tear down
  • 4.
    Big Data AnalyticsTools & Use Cases vs. “Traditional BI” Traditional BI • Sales reports • Post-campaign marketing research & analysis • CRM reports • Enterprise data assets • Can’t miss any transactions, records or rows • DWs • Relational Databases • Well-defined and format data sources • Direct connections to OLTP and LOB data sources • Excel • Well-defined business semantic models • OLAP cubes • MDM, Data Quality, Data Governance Big Data Analytics • Sentiment Analysis • Predictive Maintenance • Churn Analytics • Customer Analytics • Real-time marketing • Avoid simply siphoning off data for BI tools • Architect multiple paths for data pipelines: speed, batch, analytical • Plan for data of varying types, volumes and formats • Data can/will land at any time, any speed, any format • It’s OK to miss a few records and data points • NoSQL • MPP DWs • Hadoop, Spark, Storm • R & ML to find patterns in masses of data lakes
  • 5.
    • Key Values/ JSON / CSV • Compress files • Columnar • Land raw data fast • Data Wrangle/Munge/Engineer • Find patterns • Prepare for business models • Present to business decision makers A few basic fundamentals Big Data Analytics in the Cloud Collect and land data in lake Process data pipelines (stream, batch, analysis) Presentation Layer: Surface knowledge to business decision makers
  • 6.
  • 7.
    Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services PowerBI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  • 9.
    Azure Data Factory Whatit is: When to use it: A pipeline system to move data in, perform activities on data, move data around, and move data out • Create solutions using multiple tools as a single process • Orchestrate processes - Scheduling • Monitor and manage pipelines • Call and re-train Azure ML models
  • 10.
  • 11.
  • 12.
    Example – CustomerChurn Call Log Files Customer Table Call Log Files Customer Table Customer Churn Table Azure Data Factory: Data Sources Customers Likely to Churn Customer Call Details Transform & Analyze PublishIngest
  • 13.
    Simple ADF • BusinessGoal: Transform and Analyze Web Logs each month • Design Process: Transform Raw Weblogs, using a Hive Query, storing the results in Blob Storage Web Logs Loaded to Blob Files ready for analysis and use in AzureML HDInsight HIVE query to transform Log entries
  • 14.
    Azure SQL DataWarehouse What it is: When to use it: A Scaling Data Warehouse Service in the Cloud • When you need a large-data BI solution in the cloud • MPP SQL Server in the Cloud • Elastic scale data warehousing • When you need pause-able scale-out compute
  • 15.
    Elastic scale &performance Real-time elasticity Resize in <1 minute On-demand compute Expand or reduce as needed Pause Data Warehouse to Save on Compute Costs. I.e. Pause during non-business hours
  • 16.
    Storage can beas big or small as required Users can execute niche workloads without re-scanning data Elastic scale & performance Scale
  • 17.
  • 18.
    SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECTCOUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]; Compute Control
  • 19.
    Azure Data Lake Whatit is: When to use it: Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark, HBase, Storm, U-SQL) Engines • Low-cost, high-throughput data store • Non-relational data • Larger storage limits than Blobs
  • 20.
    Ingest all data regardlessof requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop and ADLA Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 21.
    WebHDFS YARN U-SQL ADL Analytics ADLHDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  • 22.
    No limits toSCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud Optimized for analytic workload PERFORMANCE ENTERPRISE GRADE authentication, access control, audit, encryption at rest Azure Data Lake Store A hyperscale repositoryfor big data analyticsworkloads Introducing ADLS
  • 23.
    Enterprise- grade Limitless scaleProductivity from dayone Easy and powerful data preparation All data 23 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100 10001010101001010010001010101001 0100100010101010010100100010101 0100101001000101010100101001000 10101010010100100010101010010100
  • 24.
    Developing big dataapps Author, debug, & optimize big data apps in Visual Studio Multiple Languages U-SQL, Hive, & Pig Seamlessly integrate .NET
  • 25.
    Work across allcloud data Azure Data Lake Analytics Azure SQL DW Azure SQL DB Azure Storage Blobs Azure Data Lake Store SQL DB in an Azure VM
  • 26.
    What is U-SQL? A hyper-scalable,highly extensible language for preparing, transforming and analyzing all data Allows users to focus on the what—not the how—of business problems Built on familiar languages (SQL and C#) and supported by a fully integrated development environment Built for data developers & scientists 26
  • 27.
    U-SQL language philosophy 27 Declarativequery and transformation language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL Analytics functions • Optimizable, scalable Operates on unstructured & structured data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language is C# 21 User-defined functions (U-SQL and C#) User-defined types (U-SQL/C#) (future) User-defined aggregators (C#) User-defined operators (UDO) (C#) U-SQL provides the parallelization and scale-out framework for usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Federated query across distributed data sources (soon) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt“ USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt“ USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , SUM(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 28.
    Expression-flow programming style Automatic"in-lining" of SQLIP expressions – whole script leads to a single execution model Execution plan that is optimized out-of- the-box and w/o user intervention Per-job and user-driven parallelization Detail visibility into execution steps, for debugging Heat map functionality to identify performance bottlenecks 010010 100100 010101
  • 29.
    “Unstructured” Files • Schemaon Read • Write to File • Built-in and custom Extractors and Outputters • ADL Storage and Azure Blob Storage EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv; • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io) Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  • 30.
    Expression-flow Programming Style 12 •Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. • Execution plan that is optimized out-of-the- box and w/o user intervention. • Per job and user driven level of parallelization. • Detail visibility into execution steps, for debugging. • Heatmap like functionality to identify performance bottlenecks.
  • 31.
  • 32.
    What can youdo with Visual Studio? 32 Visualize and replay progress of job Fine-tune query performance Visualize physical plan of U-SQL query Browse metadata catalog Author U-SQL scripts (with C# code) Create metadata objects Submit and cancel U-SQL Jobs Debug U-SQL and C# code
  • 33.
  • 34.
    Authoring U-SQL queries 34 VisualStudio fully supports authoring U-SQL scripts While editing, it provides: IntelliSense Syntax color coding Syntax checking … Contextual Menu
  • 35.
    Job execution graph 35 Aftera job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuously Important stats about the job are also displayed and updated continuously
  • 36.
    Job diagnostics Diagnostics information isshown to help with debugging and performance issues
  • 37.
    HDInsight: Cloud ManagedHadoop What it is: When to use it: Microsoft’s implementation of apache Hadoop (as a service) that uses Blobs for persistent storage • When you need to process large scale data (PB+) • When you want to use Hadoop or Spark as a service • When you want to compute data and retire the servers, but retain the results • When your team is familiar with the Hadoop Zoo
  • 38.
    Hadoop and HDInsight Usingthe Hadoop Ecosystem to process and query data
  • 39.
    HDInsight Tools forVisual Studio
  • 44.
    Deploying HDInsight Clusters •Cluster Type: Hadoop, Spark, HBase and Storm. • Hadoop clusters: for query and analysis workloads • HBase clusters: for NoSQL workloads • Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads • Operating System: Windows or Linux • Can be deployed from Azure portal, Azure Command Line Interface (CLI), or Azure PowerShell and Visual Studio • A UI dashboard is provided to the cluster through Ambari. • Remote Access through SSH, REST API, ODBC, JDBC. • Remote Desktop (RDP) access for Windows clusters
  • 45.
    Azure Machine Learning Whatit is: When to use it: A multi-platform environment and engine to create and deploy Machine Learning models and API’s • When you need to create predictive analytics • When you need to share Data Science experiments across teams • When you need to create call-able API’s for ML functions • When you also have R and Python experience on your Data Science team
  • 46.
    Creating an Experiment Get/Prepare Data Build/Edit Experiment Create/Update Model EvaluateModel Results Build and ModelCreate Workspace Deploy Model Consume Model
  • 47.
    Basic Azure MLElements Import Data Preprocess Algorithm Train Model Split Data Score Model
  • 50.
    Power BI What itis: When to use it: Interactive Report and Visualization creation for computing and mobile platforms • When you need to create and view interactive reports that combine multiple datasets • When you need to embed reporting into an application • When you need customizable visualizations • When you need to create shared datasets, reports, and dashboards that you publish to your team
  • 51.
  • 52.
    Big Data Analytics– Data Flow
  • 53.
    Event Ingestion Patterns Business apps Custom apps Sensors anddevices Events Events Azure Data Lake Store Transformed Data Raw Events Azure Event Hubs Kafka
  • 54.
    Bulk Ingestion andPreparation Business apps Custom apps Sensors and devices Bulk Load Azure Data Factory
  • 55.
    Data Transformation Data Collection Presentation and action Queuing System Data Storage BigData Lambda Architecture Azure Search Data analytics (Excel, Power BI, Looker, Tableau) Web/thick client dashboards Devices to take action Event hub Event & data producers Applications Web and social Devices Live Dashboards DocumentDB MongoDB SQL Azure ADW Hbase Blob StorageKafka/RabbitMQ/ ActiveMQ Event hubs Azure ML Storm / Stream Analytics Hive / U-SQL Data Factory Sensors Pig Cloud gateways (web APIs) Field gateways
  • 56.
  • 57.
    Cortana Intelligence Solutions:Discover http://aka.ms/cisolutions
  • 58.
  • 59.

Editor's Notes

  • #9 What you can do with it: https://azure.microsoft.com/en-us/overview/what-is-azure/ Platform: http://microsoftazure.com Storage: https://azure.microsoft.com/en-us/documentation/services/storage/ Networking: https://azure.microsoft.com/en-us/documentation/services/virtual-network/ Security: https://azure.microsoft.com/en-us/documentation/services/active-directory/ Services: https://azure.microsoft.com/en-us/documentation/articles/best-practices-scalability-checklist/ Virtual Machines: https://azure.microsoft.com/en-us/documentation/services/virtual-machines/windows/ and https://azure.microsoft.com/en-us/documentation/services/virtual-machines/linux/ PaaS: https://azure.microsoft.com/en-us/documentation/services/app-service/
  • #10 Azure Data Factory: http://azure.microsoft.com/en-us/services/data-factory/
  • #11 Pricing: https://azure.microsoft.com/en-us/pricing/details/data-factory/
  • #12 Learning Path: https://azure.microsoft.com/en-us/documentation/articles/data-factory-introduction/ Quick Example: http://azure.microsoft.com/blog/2015/04/24/azure-data-factory-update-simplified-sample-deployment/
  • #13 Video of this process: https://azure.microsoft.com/en-us/documentation/videos/azure-data-factory-102-analyzing-complex-churn-models-with-azure-data-factory/
  • #14 More options: Prepare System: https://azure.microsoft.com/en-us/documentation/articles/data-factory-build-your-first-pipeline-using-editor/ - Follow steps Another Lab: https://azure.microsoft.com/en-us/documentation/articles/data-factory-samples/
  • #15 Azure SQL Data Warehouse: http://azure.microsoft.com/en-us/services/sql-data-warehouse/
  • #16 15
  • #17 16
  • #20 Azure Data Lake: http://azure.microsoft.com/en-us/campaigns/data-lake/
  • #24 All data Unstructured, Semi structured, Structured Domain-specific user defined types using C# Queries over Data Lake and Azure Blobs Federated Queries over Operational and DW SQL stores removing the complexity of ETL Productive from day one Effortless scale and performance without need to manually tune/configure Best developer experience throughout development lifecycle for both novices and experts Leverage your existing skills with SQL and .NET Easy and powerful data preparation Easy to use built-in connectors for common data formats Simple and rich extensibility model for adding customer – specific data transformation – both existing and new No limits scale Scales on demand with no change to code Automatically parallelizes SQL and custom code Designed to process petabytes of data Enterprise grade Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.) Role based authorization of Catalogs and storage accounts using AAD security Auditing of catalog objects (databases, tables etc.)
  • #26 ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
  • #28 Use for language experts
  • #38 Azure HDInsight: http://azure.microsoft.com/en-us/services/hdinsight/
  • #39 Primary site: https://azure.microsoft.com/en-us/services/hdinsight/ Quick overview: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/ 4-week online course through the edX platform: https://www.edx.org/course/processing-big-data-azure-hdinsight-microsoft-dat202-1x 11 minute introductory video: https://channel9.msdn.com/Series/Getting-started-with-Windows-Azure-HDInsight-Service/Introduction-To-Windows-Azure-HDInsight-Service Microsoft Virtual Academy Training (4 hours) - https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551?l=UJ7MAv97_5804984382 Learning path for HDInsight: https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/ Azure Feature Pack for SQL Server 2016, i.e., SSIS (SQL Server Integration Services): https://msdn.microsoft.com/en-us/library/mt146770(v=sql.130).aspx
  • #45 Azure Portal: http://azure.portal.com Provisioning Clusters: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/ Different clusters have different node types, number of nodes, and node sizes.
  • #46 Azure Machine Learning: http://azure.microsoft.com/en-us/services/machine-learning/
  • #47 Beginning Series: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-for-beginners-the-5-questions-data-science-answers/
  • #48 Designing an experiment in the Studio: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-what-is-ml-studio/
  • #51 Power BI: https://powerbi.microsoft.com/
  • #63 Customize yourself or with featured partners