KEMBAR78
Big Data Processing with .NET and Spark (SQLBits 2020) | PPTX
Big Data Processing with .NET and Spark
Michael Rys
Principal Program Manager, Azure Data
@MikeDoesBigData
Agenda What is Apache Spark
Why .NET for Apache Spark
What is .NET for Apache Spark
Demos
How does it perform
Where does it run
Special Announcement & Call to Action
 Apache Spark is an OSS fast analytics engine for big data and machine
learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes
.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring
Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.
We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Frequent releases (about every 6 weeks), current release v0.12.1
• Integrates with .NET Interactive (https://github.com/dotnet/interactive) and
nteract/Jupyter
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
Journey so far
~2k
GitHub unique
visitors/wk
~8k
GitHub page
views/wk
260
GitHub issues
closed
246
GitHub PRs
merged
127k
Nuget
Downloads
39
GitHub
Contributors
Journey so far
Customer Success: O365’s MSAI
Job:
Build ML/Deep models on top of
substrate data to infuse intelligence
to Office 365 products. Our data
resides in Azure Data Lake Storage.
We write cook/featurize data that in
turn gets fed into our ML models.
Why Spark.NET?
Given our business logic e.g.,
featurizers, tokenizers for
normalizing text, are written in C# –
Spark.NET is an ideal candidate for
our workloads. We leverage
Spark.NET to run those libraries at
scale.
Experience:
Very promising, stable & highly
vibrant community that is helping us
iterate at the agility we want.
Looking forward to longer working
relationship and broader adoption
across Substrate Intelligence / MSAI.
Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365
Scale: ~ 50 TB
.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.1
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples
UserId State Salary
Terry WA XX
Rahul WA XX
Dan WA YY
Tyson CA ZZ
Ankit WA YY
Michae
l
WA YY
Introduction to Spark Programming:
DataFrame
.NET for Apache Spark programmability
var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);
Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)
Demo 1: Getting started locally
Submitting a Spark Application
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic
Demo 2: Locally debugging a .NET for Spark
App
spark-submit --class
org.apache.spark.deploy.DotnetRunner `
--master local <path-to-microsoft-spark-jar> `
Debugging User-defined Code
https://github.com/dotnet/spark/pull/294
Step 1
Write your app code
Step 2
set DOTNET_WORKER_DEBUG=1
Run spark-submit with debug argument
Step 3
Switch to app code, add breakpoint
at your business logic, F5
Step 4
In the `Choose Just-In-Time
Debugger`, choose “New Instance of
…”, select your app code CS file
Step 5
That’s it! Have fun 
Demo 2: Twitter analysis in the Cloud
What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree
Spark Worker Node JVM
Spark Executor Microsoft.Spark.Worker
Spark Worker Node CLR
Run a task with
a UDF
1
Launch worker executable2
3 Serialize UDFs &
data
.NET UDF Library
4 Execute user-defined
operations
5 Write serialized result rows
User Spark Library
Legend:
Interop (Scala) Interop (.NET)
Challenge:
How to serialize data
between JVM & CLR?
Pickling
Row-oriented
Apache Arrow
Column-oriented
Default
Performance: Worker-side Interop
df.GroupBy("age")
.Apply(
new StructType(new[]
{
new StructField("age", new IntegerType()),
new StructField("nameCharCount", new IntegerType())
}),
batch => CountCharacters(batch, "age", "name"))
.Show();
Simplifying experience with Arrow
private static FxDataFrame CountCharacters(
FxDataFrame df,
string groupColName,
string summaryColName)
{
int charCount = 0;
for (long i = 0; i < df.RowCount; ++i)
{
charCount += ((string)df[summaryColName][i]).Length;
}
return new FxDataFrame(new[] {
new PrimitiveColumn<int>(groupColName,
new[] { (int?)df[groupColName][0] }),
new PrimitiveColumn<int>(summaryColName,
new[] { charCount }) });
}
private static RecordBatch CountCharacters(
RecordBatch records,
string groupColName,
string summaryColName)
{
int summaryColIndex = records.Schema.GetFieldIndex(summaryColName);
StringArray stringValues = records.Column(summaryColIndex) as StringArray;
int charCount = 0;
for (int i = 0; i < stringValues.Length; ++i)
{
charCount += stringValues.GetString(i).Length;
}
int groupColIndex = records.Schema.GetFieldIndex(groupColName);
Field groupCol = records.Schema.GetFieldByIndex(groupColIndex);
return new RecordBatch(
new Schema.Builder()
.Field(groupCol)
.Field(f => f.Name(summaryColName).DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
records.Column(groupColIndex),
new Int32Array.Builder().Append(charCount).Build()
},
records.Length);
}
Previous Experience New Experience
Simplifying experience with Arrow
Performance –
warm cluster runs
for Pickling
Serialization
(Arrow
improvements see
next slide)
Takeaway 1: Where UDF
performance does not
matter, .NET is on-par
with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!
Performance –
Warm Cluster
Runs for C#
Pickling vs.
Arrow
Serialization
Takeaway: Since Q1 is
interop bound, we see 33%
perf improvement with
better serialization
Performance –
Warm Cluster
Runs for Arrow
Serialization in
C# vs. Python
Takeaway: Since serialization
inefficiencies have been removed,
we are left with similar perf across
languages – if you like .NET, you can
stick with .NET 
Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission
Language selects semantics of
submission fields
ZIP file that contains the Spark
application, including UDF DLLs, and
even the Spark or .NET Runtime if a
different version is needed
Main Program (Unix)
Program Parameters as needed
Additional resource and library files
that are not included in the ZIP (e.g.,
shared DLLs, config files)
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive
Language selects Type of notebook
Interactive C#
Spark context spark is built-in
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive – importing nuget packages
Using .NET for Spark in Azure Databricks
• Not available out of the box but can be used in batch submission
• https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks
Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET.
Please contact @Databricks if you want to use it out of the box 
VSCode extension for Spark .NET
• Spark .NET Project creation​
• Dependency packaging​
• Language service
• Sample code
Author
• Reference management
• Spark local run​
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight clusters
navigation
 Azure Databricks integration planned
ANNOUNCING: .NET for Apache Spark v1.0 is released!
 First-class C# and F# bindings to Apache Spark,
bringing the power of big data analytics to .NET
developers
Apache Spark 2.4/3.0
Data Frames, Structured
Streaming, Delta Lake
.NET Standard 2.0
C# and F#
ML.NET
Performance optimized
with Apache Arrow and
HW Vectorization
First class integration in
Azure Synapse: Batch
Submission
Interactive .NET notebooks
Learn more at
http://dot.net/Spark
More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g.,
Jupyter/nteract,
VS Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB Spark,
SQL 2019 BDC, …)
Call to action: Engage, use & guide us!
Related session:
• Big Data and Data Warehousing Together with Azure
Synapse Analytics
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You &
@MikeDoesBigData #DotNetForSpark
© Copyright Microsoft Corporation. All rights reserved.

Big Data Processing with .NET and Spark (SQLBits 2020)

  • 1.
    Big Data Processingwith .NET and Spark Michael Rys Principal Program Manager, Azure Data @MikeDoesBigData
  • 2.
    Agenda What isApache Spark Why .NET for Apache Spark What is .NET for Apache Spark Demos How does it perform Where does it run Special Announcement & Call to Action
  • 3.
     Apache Sparkis an OSS fast analytics engine for big data and machine learning  Improves efficiency through:  General computation graphs beyond map/reduce  In-memory computing primitives  Allows developers to scale out their user code & write in their language of choice  Rich APIs in Java, Scala, Python, R, SparkSQL etc.  Batch processing, streaming and interactive shell  Available on Azure via Azure Synapse Azure Databricks Azure HDInsight IaaS/Kubernetes
  • 4.
    .NET Developers 💖Apache Spark… A lot of big data-usable business logic (millions of lines of code) is written in .NET! Expensive and difficult to translate into Python/Scala/Java! Locked out from big data processing due to lack of .NET support in OSS big data solutions In a recently conducted .NET Developer survey (> 1000 developers), more than 70% expressed interest in Apache Spark! Would like to tap into OSS eco-system for: Code libraries, support, hiring
  • 5.
    Goal: .NET forApache Spark is aimed at providing .NET developers a first-class experience when working with Apache Spark. Non-Goal: Converting existing Scala/Python/Java Spark developers.
  • 6.
    We are developingit in the open! Contributions to foundational OSS projects: • Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284, SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373 • Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737, ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887, ARROW-5908, ARROW-6314, ARROW-6682 • Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance .NET for Apache Spark is open source • Website: https://dot.net/spark • GitHub: https://github.com/dotnet/spark • Frequent releases (about every 6 weeks), current release v0.12.1 • Integrates with .NET Interactive (https://github.com/dotnet/interactive) and nteract/Jupyter Spark project improvement proposals: • Interop support for Spark language extensions: SPARK-26257 • .NET bindings for Apache Spark: SPARK-27006
  • 7.
    Journey so far ~2k GitHubunique visitors/wk ~8k GitHub page views/wk 260 GitHub issues closed 246 GitHub PRs merged 127k Nuget Downloads 39 GitHub Contributors
  • 8.
  • 9.
    Customer Success: O365’sMSAI Job: Build ML/Deep models on top of substrate data to infuse intelligence to Office 365 products. Our data resides in Azure Data Lake Storage. We write cook/featurize data that in turn gets fed into our ML models. Why Spark.NET? Given our business logic e.g., featurizers, tokenizers for normalizing text, are written in C# – Spark.NET is an ideal candidate for our workloads. We leverage Spark.NET to run those libraries at scale. Experience: Very promising, stable & highly vibrant community that is helping us iterate at the agility we want. Looking forward to longer working relationship and broader adoption across Substrate Intelligence / MSAI. Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365 Scale: ~ 50 TB
  • 10.
    .NET provides full-spectrumSpark support Spark DataFrames with SparkSQL Works with Spark v2.3.x/v2.4.x and includes ~300 SparkSQL functions Grouped Map Delta Lake .NET Spark UDFs Batch & streaming Including Spark Structured Streaming and all Spark-supported data sources .NET Standard 2.0 Works with .NET Framework v4.6.1+ and .NET Core v2.1/v3.1 and includes C#/F# support .NET Standard Data Science Including access to ML.NET Interactive Notebook with C# REPL Speed & productivity Performance optimized interop, as fast or faster than pySpark, Support for HW Vectorization https://github.com/dotnet/spark/examples
  • 11.
    UserId State Salary TerryWA XX Rahul WA XX Dan WA YY Tyson CA ZZ Ankit WA YY Michae l WA YY Introduction to Spark Programming: DataFrame
  • 12.
    .NET for ApacheSpark programmability var spark = SparkSession.Builder().GetOrCreate(); var dataframe = spark.Read().Json(“input.json”); dataframe.Filter(df["age"] > 21) .Select(concat(df[“age”], df[“name”]).Show(); var concat = Udf<int?, string, string>((age, name)=>name+age);
  • 13.
    Language comparison: TPC-HQuery 2 val europe = region.filter($"r_name" === "EUROPE") .join(nation, $"r_regionkey" === nation("n_regionkey")) .join(supplier, $"n_nationkey" === supplier("s_nationkey")) .join(partsupp, supplier("s_suppkey") === partsupp("ps_suppkey")) val brass = part.filter(part("p_size") === 15 && part("p_type").endsWith("BRASS")) .join(europe, europe("ps_partkey") === $"p_partkey") val minCost = brass.groupBy(brass("ps_partkey")) .agg(min("ps_supplycost").as("min")) brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey")) .filter(brass("ps_supplycost") === minCost("min")) .select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .sort($"s_acctbal".desc, $"n_name", $"s_name", $"p_partkey") .limit(100) .show() var europe = region.Filter(Col("r_name") == "EUROPE") .Join(nation, Col("r_regionkey") == nation["n_regionkey"]) .Join(supplier, Col("n_nationkey") == supplier["s_nationkey"]) .Join(partsupp, supplier["s_suppkey"] == partsupp["ps_suppkey"]); var brass = part.Filter(part["p_size"] == 15 & part["p_type"].EndsWith("BRASS")) .Join(europe, europe["ps_partkey"] == Col("p_partkey")); var minCost = brass.GroupBy(brass["ps_partkey"]) .Agg(Min("ps_supplycost").As("min")); brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"]) .Filter(brass["ps_supplycost"] == minCost["min"]) .Select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .Sort(Col("s_acctbal").Desc(), Col("n_name"), Col("s_name"), Col("p_partkey")) .Limit(100) .Show(); Similar syntax – dangerously copy/paste friendly! $”col_name” vs. Col(“col_name”) Capitalization Scala C# C# vs Scala (e.g., == vs ===)
  • 14.
    Demo 1: Gettingstarted locally
  • 15.
    Submitting a SparkApplication spark-submit ` --class <user-app-main-class> ` --master local ` <path-to-user-jar> <argument(s)-to-your-app> spark-submit (Scala) spark-submit ` --class org.apache.spark.deploy.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app> spark-submit (.NET) Provided by .NET for Apache Spark Library Provided by User & has business logic
  • 16.
    Demo 2: Locallydebugging a .NET for Spark App spark-submit --class org.apache.spark.deploy.DotnetRunner ` --master local <path-to-microsoft-spark-jar> `
  • 17.
    Debugging User-defined Code https://github.com/dotnet/spark/pull/294 Step1 Write your app code Step 2 set DOTNET_WORKER_DEBUG=1 Run spark-submit with debug argument Step 3 Switch to app code, add breakpoint at your business logic, F5 Step 4 In the `Choose Just-In-Time Debugger`, choose “New Instance of …”, select your app code CS file Step 5 That’s it! Have fun 
  • 18.
    Demo 2: Twitteranalysis in the Cloud
  • 19.
    What is happeningwhen you write .NET Spark code? DataFrame SparkSQL .NET for Apache Spark .NET Program Did you define a .NET UDF? Regular execution path (no .NET runtime during execution) Same Speed as with Scala Spark Interop between Spark and .NET Faster than with PySpark No Yes Spark operation tree
  • 20.
    Spark Worker NodeJVM Spark Executor Microsoft.Spark.Worker Spark Worker Node CLR Run a task with a UDF 1 Launch worker executable2 3 Serialize UDFs & data .NET UDF Library 4 Execute user-defined operations 5 Write serialized result rows User Spark Library Legend: Interop (Scala) Interop (.NET) Challenge: How to serialize data between JVM & CLR? Pickling Row-oriented Apache Arrow Column-oriented Default Performance: Worker-side Interop
  • 21.
    df.GroupBy("age") .Apply( new StructType(new[] { new StructField("age",new IntegerType()), new StructField("nameCharCount", new IntegerType()) }), batch => CountCharacters(batch, "age", "name")) .Show(); Simplifying experience with Arrow
  • 22.
    private static FxDataFrameCountCharacters( FxDataFrame df, string groupColName, string summaryColName) { int charCount = 0; for (long i = 0; i < df.RowCount; ++i) { charCount += ((string)df[summaryColName][i]).Length; } return new FxDataFrame(new[] { new PrimitiveColumn<int>(groupColName, new[] { (int?)df[groupColName][0] }), new PrimitiveColumn<int>(summaryColName, new[] { charCount }) }); } private static RecordBatch CountCharacters( RecordBatch records, string groupColName, string summaryColName) { int summaryColIndex = records.Schema.GetFieldIndex(summaryColName); StringArray stringValues = records.Column(summaryColIndex) as StringArray; int charCount = 0; for (int i = 0; i < stringValues.Length; ++i) { charCount += stringValues.GetString(i).Length; } int groupColIndex = records.Schema.GetFieldIndex(groupColName); Field groupCol = records.Schema.GetFieldByIndex(groupColIndex); return new RecordBatch( new Schema.Builder() .Field(groupCol) .Field(f => f.Name(summaryColName).DataType(Int32Type.Default)) .Build(), new IArrowArray[] { records.Column(groupColIndex), new Int32Array.Builder().Append(charCount).Build() }, records.Length); } Previous Experience New Experience Simplifying experience with Arrow
  • 23.
    Performance – warm clusterruns for Pickling Serialization (Arrow improvements see next slide) Takeaway 1: Where UDF performance does not matter, .NET is on-par with Python Takeaway 2: Where UDF performance is critical, .NET is ~2x faster than Python!
  • 24.
    Performance – Warm Cluster Runsfor C# Pickling vs. Arrow Serialization Takeaway: Since Q1 is interop bound, we see 33% perf improvement with better serialization
  • 25.
    Performance – Warm Cluster Runsfor Arrow Serialization in C# vs. Python Takeaway: Since serialization inefficiencies have been removed, we are left with similar perf across languages – if you like .NET, you can stick with .NET 
  • 26.
    Works everywhere! Cross platform CrossCloud Windows Ubuntu Azure & AWS Databricks macOS AWS EMR Spark Azure HDI Spark Installed out of the box Azure Synapse Installation docs on Github
  • 27.
    • cd mySparkApp dotnetpublish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 28.
    • cd mySparkApp dotnetpublish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission Language selects semantics of submission fields ZIP file that contains the Spark application, including UDF DLLs, and even the Spark or .NET Runtime if a different version is needed Main Program (Unix) Program Parameters as needed Additional resource and library files that are not included in the ZIP (e.g., shared DLLs, config files) https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 29.
    Using .NET forSpark in Azure Synapse Notebooks with .NET Interactive Language selects Type of notebook Interactive C# Spark context spark is built-in
  • 30.
    Using .NET forSpark in Azure Synapse Notebooks with .NET Interactive – importing nuget packages
  • 31.
    Using .NET forSpark in Azure Databricks • Not available out of the box but can be used in batch submission • https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET. Please contact @Databricks if you want to use it out of the box 
  • 32.
    VSCode extension forSpark .NET • Spark .NET Project creation​ • Dependency packaging​ • Language service • Sample code Author • Reference management • Spark local run​ • Spark cluster run (e.g. HDInsight) Run • DebugFix Extension to VSCode  Tap into VSCode for C# programming  Automate Maven and Spark dependency for environment setup  Facilitate first project success through project template and sample code  Support Spark local run and cluster run  Integrate with Azure for HDInsight clusters navigation  Azure Databricks integration planned
  • 33.
    ANNOUNCING: .NET forApache Spark v1.0 is released!  First-class C# and F# bindings to Apache Spark, bringing the power of big data analytics to .NET developers Apache Spark 2.4/3.0 Data Frames, Structured Streaming, Delta Lake .NET Standard 2.0 C# and F# ML.NET Performance optimized with Apache Arrow and HW Vectorization First class integration in Azure Synapse: Batch Submission Interactive .NET notebooks Learn more at http://dot.net/Spark
  • 34.
    More programming experiences in .NET (UDAF, UDT support,multi- language UDFs) What’s next? Spark data connectors in .NET (e.g., Apache Kafka, Azure Blob Store, Azure Data Lake) Tooling experiences (e.g., Jupyter/nteract, VS Code, Visual Studio, others?) Idiomatic experiences for C# and F# (LINQ, Type Provider) Go to https://github.com/dotnet/spark and let us know what is important to you! Out-of-Box Experiences (Azure Synapse, Azure HDInsight, Azure Databricks, Cosmos DB Spark, SQL 2019 BDC, …)
  • 35.
    Call to action:Engage, use & guide us! Related session: • Big Data and Data Warehousing Together with Azure Synapse Analytics Useful links: • http://github.com/dotnet/spark • https://www.nuget.org/packages/Microsoft.Spark https://aka.ms/GoDotNetForSpark • https://docs.microsoft.com/dotnet/spark Website: • https://dot.net/spark (Request a Demo!) Starter Videos .NET for Apache Spark 101: • Watch on YouTube • Watch on Channel 9 Available out-of-box on Azure Synapse & Azure HDInsight Spark Running .NET for Spark anywhere— https://aka.ms/InstallDotNetForSpark You & @MikeDoesBigData #DotNetForSpark
  • 36.
    © Copyright MicrosoftCorporation. All rights reserved.

Editor's Notes

  • #4 3
  • #10 “Spark.Net team helped enhance the user experience which was a major issue for adoption in Satori”
  • #11 No RDD support.