KEMBAR78
Why scala for data science | PDF
Why Scala for Data
Science?
HELLO!
I am Guglielmo
Iozzia
I am here because I love AI and the
With the Best conference series
You can follow me at
@GuglielmoIozzia
2
Something about me
✘ Big Data Delivery Lead at
(UHG)
✘ Previously at and of the UN
✘ Current fields of expertise are Big
Data, ML/DL and DevOps
✘ Author of the upcoming book “Hands-
on Deep Learning with Apache Spark”
✘ I love preparing
home-made pizza3
What is Scala?
Let’s get everyone on the same
page
The Scala PL
Scala is a programming language
that blends object-oriented and
functional programming concepts on
the JVM.
5
Functional Programming
✘ In FP you write pure functions.
✘ Given the same input, a function
always return the same output,
producing no side effect.
✘ A function is first-class: it can be used
like any other type.
✘ That means that it can be assigned to
a variable, passed as a parameter to
another function or returned by a
function.6
Place your screenshot here
Functional Programming
in Scala
An example of
functional
programming in
Scala.
7
Why Scala for Data
Science?
Let’s move towards the main topic
of this talk
The Python’s Temptation
When it comes to Data Science the first programming
language people take into consideration is Python.
9
Here are three valid reasons to
consider Scala.
10
#1 Robustness
Robustness and performance when it
comes to production system and
large datasets.
11
#2 Integration
Most part of the systems/tools in the
Big Data/ML space run on the JVM.
12
Think about these systems
you most probably have in
your production tech stack.
They all run in JVMs.
13
#3 Libraries
Good availability of ready to
production Open Source ML/DL
frameworks and libraries.
14
Scala Open Source Projects for AI/ML/DL
✘ Spark MLlib: Spark’s library for ML
algorithms, feature extraction,
dimensionality reduction, linear
algebra, etc.
✘ ND4J: a linear algebra and matrix
manipulation library which supports n-
dimensional arrays and it is integrated
with Apache Hadoop and Spark.
15
Scala Open Source Projects for AI/ML/DL
✘ DeepLearning4J: a distributed deep-
learning framework written for Java
and Scala. It is integrated with Hadoop
and Apache Spark, for use on
distributed GPUs and CPUs.
✘ BigDL: a distributed deep learning
framework for Apache Spark, created
at Intel.
16
Scala Open Source Projects for AI/ML/DL
✘ XGBoost: a scalable, portable and
distributed Gradient Boosting library.
✘ PredictionIO: an Apache template
system for creating machine learning
engines.
✘ Smile: a fast and comprehensive
machine learning system.
✘ Saddle: a high-performance data
manipulation library.17
Scala Open Source Projects for AI/ML/DL
✘ Deeplearning.scala: a simple library
for creating complex neural networks.
It can be used either in standalone
JVM applications or Jupyter
Notebooks.
✘ ScalaNLP: a suite of ML and
numerical computing libraries. It
includes Breeze and Epic.
18
Code Examples
Let’s get practical!
object Nd4JScalaSample {
def main (args: Array[String]) {
// Create arrays using the numpy syntax
var arr1 = Nd4j.create(4)
val arr2 = Nd4j.linspace(1, 10, 10)
// Fill an array with the value 5 (equivalent to fill method in numpy)
println(arr1.assign(5) + "Assigned value of 5 to the array")
// Basic stats methods
println(Nd4j.mean(arr1) + "Calculate mean of array")
println(Nd4j.std(arr2) + "Calculate standard deviation of array")
println(Nd4j.`var`(arr2), "Calculate variance")
...
ND4J Example
ND4J tries to fill the
gap between JVM
languages and
Python
programmers in
terms of availability
of powerful data
analysis tools.
20
Place your screenshot here
DL4J Example (1 of 3)
Multilayer Neural
Network
configuration in
Scala with DL4J.
21
Place your screenshot here
DL4J Example (2 of 3)
Network
initialization and
training in Scala
with DL4J.
22
Place your screenshot here
DL4J Example (3 of 3)
The DL4J web UI
(training time).
23
Can Scala and Python
co-exist in Data Science
projects?
Is there any bridge between this
two worlds?
139,000
The result of a search on Google about MNN models
implemented through Tensorflow
8,330,000
The result of a generic search on Google about models
implemented through Tensorflow
120,000
The result of a search on Google about MNN examples
implemented through Tensorflow
25
Tensorflow Pros and Cons
✘ Big community
✘ Lots of models, example and use
cases available
✘ Stunning features
Mostly Python. The Java API is currently
experimental and is not covered by the
TensorFlow API stability guarantees.
26
Keras to the Rescue
✘ It is an open source neural network
library written in Python
✘ It can run on top of TensorFlow (and
other backend engines)
✘ Easy prototyping
✘ Lightweight
✘ Can be used to import Python models
to DL4J
27
TensorFlow + Keras + DL4J
28
Place your screenshot here
Importing Keras Models
into DL4J: example
DL4J provides
Java/Scala API to
import a pre-trained
TensorFlow model
through Keras.
29
Place your screenshot here
Importing Keras Models
into DL4J: example
The imported model
can then be used in
a DL4J application
implemented
through Java or
Scala only.
30
Conclusion
Bridging the Gap between Data
Engineers and Data Scientists
The Missing Link
Data Engineers
• Scala/Java skills
and experience
• Hands-on Big Data
and Streaming tools
(Hadoop, HBase,
Spark, Kafka, Beam,
etc.)
• DevOps mindset
• Attention on testing,
performance,
scalability
• Containerization
• Often no skills in
ML/DL
Data Scientist
• Strong ML/DL skills
• Python and R users
• Good data
understanding
• Model training and
evaluating strategies
• Probably knowledge
on Big Data and
Streaming tools
• No DevOps mindset
• Research more than
production
32
To Leaverage the Specific Skills of Each Team
DL4J
Keras
TensorFlow
Data Engineers Data Scientists
33
To Leaverage the Specific Skills of Each Team
Keras
Scala
(DL4J)
TensorFlow
(Python)
34
Place your screenshot here
Hands-on Deep Learning
with Apache Spark
More on some topics
covered in this talk
can be found in this
book.
https://tinyurl.com/y9jkvtuy
35
THANK
YOU!
Any questions?
You can find me at
✘ @GuglielmoIozzia
✘ https://ie.linkedin.com/in/giozzia
✘ googlielmo.blogspot.com/
✘ https://dzone.com/users/253294
8/virtualramblas.html
36
Credits
Special thanks to all the people who made
and released these awesome resources for
free:
✘ Presentation template by SlidesCarnival
✘ The painting in slide 9 is a detail of “Eve
Tempted” (1887) by John Roddam
Spencer Stanhope
37

Why scala for data science

  • 1.
    Why Scala forData Science?
  • 2.
    HELLO! I am Guglielmo Iozzia Iam here because I love AI and the With the Best conference series You can follow me at @GuglielmoIozzia 2
  • 3.
    Something about me ✘Big Data Delivery Lead at (UHG) ✘ Previously at and of the UN ✘ Current fields of expertise are Big Data, ML/DL and DevOps ✘ Author of the upcoming book “Hands- on Deep Learning with Apache Spark” ✘ I love preparing home-made pizza3
  • 4.
    What is Scala? Let’sget everyone on the same page
  • 5.
    The Scala PL Scalais a programming language that blends object-oriented and functional programming concepts on the JVM. 5
  • 6.
    Functional Programming ✘ InFP you write pure functions. ✘ Given the same input, a function always return the same output, producing no side effect. ✘ A function is first-class: it can be used like any other type. ✘ That means that it can be assigned to a variable, passed as a parameter to another function or returned by a function.6
  • 7.
    Place your screenshothere Functional Programming in Scala An example of functional programming in Scala. 7
  • 8.
    Why Scala forData Science? Let’s move towards the main topic of this talk
  • 9.
    The Python’s Temptation Whenit comes to Data Science the first programming language people take into consideration is Python. 9
  • 10.
    Here are threevalid reasons to consider Scala. 10
  • 11.
    #1 Robustness Robustness andperformance when it comes to production system and large datasets. 11
  • 12.
    #2 Integration Most partof the systems/tools in the Big Data/ML space run on the JVM. 12
  • 13.
    Think about thesesystems you most probably have in your production tech stack. They all run in JVMs. 13
  • 14.
    #3 Libraries Good availabilityof ready to production Open Source ML/DL frameworks and libraries. 14
  • 15.
    Scala Open SourceProjects for AI/ML/DL ✘ Spark MLlib: Spark’s library for ML algorithms, feature extraction, dimensionality reduction, linear algebra, etc. ✘ ND4J: a linear algebra and matrix manipulation library which supports n- dimensional arrays and it is integrated with Apache Hadoop and Spark. 15
  • 16.
    Scala Open SourceProjects for AI/ML/DL ✘ DeepLearning4J: a distributed deep- learning framework written for Java and Scala. It is integrated with Hadoop and Apache Spark, for use on distributed GPUs and CPUs. ✘ BigDL: a distributed deep learning framework for Apache Spark, created at Intel. 16
  • 17.
    Scala Open SourceProjects for AI/ML/DL ✘ XGBoost: a scalable, portable and distributed Gradient Boosting library. ✘ PredictionIO: an Apache template system for creating machine learning engines. ✘ Smile: a fast and comprehensive machine learning system. ✘ Saddle: a high-performance data manipulation library.17
  • 18.
    Scala Open SourceProjects for AI/ML/DL ✘ Deeplearning.scala: a simple library for creating complex neural networks. It can be used either in standalone JVM applications or Jupyter Notebooks. ✘ ScalaNLP: a suite of ML and numerical computing libraries. It includes Breeze and Epic. 18
  • 19.
  • 20.
    object Nd4JScalaSample { defmain (args: Array[String]) { // Create arrays using the numpy syntax var arr1 = Nd4j.create(4) val arr2 = Nd4j.linspace(1, 10, 10) // Fill an array with the value 5 (equivalent to fill method in numpy) println(arr1.assign(5) + "Assigned value of 5 to the array") // Basic stats methods println(Nd4j.mean(arr1) + "Calculate mean of array") println(Nd4j.std(arr2) + "Calculate standard deviation of array") println(Nd4j.`var`(arr2), "Calculate variance") ... ND4J Example ND4J tries to fill the gap between JVM languages and Python programmers in terms of availability of powerful data analysis tools. 20
  • 21.
    Place your screenshothere DL4J Example (1 of 3) Multilayer Neural Network configuration in Scala with DL4J. 21
  • 22.
    Place your screenshothere DL4J Example (2 of 3) Network initialization and training in Scala with DL4J. 22
  • 23.
    Place your screenshothere DL4J Example (3 of 3) The DL4J web UI (training time). 23
  • 24.
    Can Scala andPython co-exist in Data Science projects? Is there any bridge between this two worlds?
  • 25.
    139,000 The result ofa search on Google about MNN models implemented through Tensorflow 8,330,000 The result of a generic search on Google about models implemented through Tensorflow 120,000 The result of a search on Google about MNN examples implemented through Tensorflow 25
  • 26.
    Tensorflow Pros andCons ✘ Big community ✘ Lots of models, example and use cases available ✘ Stunning features Mostly Python. The Java API is currently experimental and is not covered by the TensorFlow API stability guarantees. 26
  • 27.
    Keras to theRescue ✘ It is an open source neural network library written in Python ✘ It can run on top of TensorFlow (and other backend engines) ✘ Easy prototyping ✘ Lightweight ✘ Can be used to import Python models to DL4J 27
  • 28.
  • 29.
    Place your screenshothere Importing Keras Models into DL4J: example DL4J provides Java/Scala API to import a pre-trained TensorFlow model through Keras. 29
  • 30.
    Place your screenshothere Importing Keras Models into DL4J: example The imported model can then be used in a DL4J application implemented through Java or Scala only. 30
  • 31.
    Conclusion Bridging the Gapbetween Data Engineers and Data Scientists
  • 32.
    The Missing Link DataEngineers • Scala/Java skills and experience • Hands-on Big Data and Streaming tools (Hadoop, HBase, Spark, Kafka, Beam, etc.) • DevOps mindset • Attention on testing, performance, scalability • Containerization • Often no skills in ML/DL Data Scientist • Strong ML/DL skills • Python and R users • Good data understanding • Model training and evaluating strategies • Probably knowledge on Big Data and Streaming tools • No DevOps mindset • Research more than production 32
  • 33.
    To Leaverage theSpecific Skills of Each Team DL4J Keras TensorFlow Data Engineers Data Scientists 33
  • 34.
    To Leaverage theSpecific Skills of Each Team Keras Scala (DL4J) TensorFlow (Python) 34
  • 35.
    Place your screenshothere Hands-on Deep Learning with Apache Spark More on some topics covered in this talk can be found in this book. https://tinyurl.com/y9jkvtuy 35
  • 36.
    THANK YOU! Any questions? You canfind me at ✘ @GuglielmoIozzia ✘ https://ie.linkedin.com/in/giozzia ✘ googlielmo.blogspot.com/ ✘ https://dzone.com/users/253294 8/virtualramblas.html 36
  • 37.
    Credits Special thanks toall the people who made and released these awesome resources for free: ✘ Presentation template by SlidesCarnival ✘ The painting in slide 9 is a detail of “Eve Tempted” (1887) by John Roddam Spencer Stanhope 37