KEMBAR78
Spark 101 - First steps to distributed computing | PPTX
Demi Ben-Ari
10/2015
About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
Agenda
 What is Spark?
 Spark Infrastructure and Basics
 Spark Features and Suite
 Development with Spark
 Conclusion
What is Spark?
Efficient Usable
 General execution
graphs
 In-memory storage
 Rich APIs in Java,
Scala, Python
 Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
What is Spark?
 Apache Spark is a general-purpose, cluster
computing framework
 Spark does computation In Memory & on
Disk
 Apache Spark has low level and high level
APIs
About Spark project
 Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
 Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
 Currently stable in version 1.5
Spark Philosophy
 Make life easy and productive for data
scientists
 Well documented, expressive API’s
 Powerful domain specific libraries
 Easy integration with storage systems
 … and caching to avoid data movement
 Predictable releases, stable API’s
 Stable release each 3 months
Unified Tools Platform
Unified Tools Platform
Spark
SQL
GraphX
MLlib
Machine
Learning
Spark
Streamin
g
Spark Core
Spark Core Features
 Distributed In memory Computation
 Stand alone and Local Capabilities
 History server for Spark UI
 Resource management Integration
 Unified job submission tool
Spark Contributors
 Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
 https://www.openhub.net/p/apache-spark
Spark Petabyte Sort
Basic Terms
 Cluster (Master, Slaves)
 Driver
 Executors
 Spark Context
 RDD – Resilient Distributed Dataset
Resilient Distributed Datasets
Resilient Distributed Datasets
Spark execution engine
 Spark uses lazy evaluation
 Runs the code only when it encounters
an action operation
 There is no need to design and write a
single complex map-reduce job.
 In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
Spark execution engine
 Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
 In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.
Spark Execution - UI
Persistence layers for Spark
 Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
 File formats
◦ Text file
 CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet
History Server
 Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
 Integrates both with YARN and Mesos
 In Yarn / Mesos, run history server as
a daemon.
Job Submission Tool
 ./bin/spark-submit <app-jar> 
--class my.main.Class
--name myAppName
--master local[4]
--master spark://some-cluster
Multi Language API Support
 Scala
 Java
 Python
 Clojure
Spark Shell
 YouTube – Word Count Example
Cassandra & Spark
 Cassandra cluster
◦ Bare metal vs. On the cloud
 DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
 Vs
◦ Separate Cassandra and Spark clusters
Development with Spark
Where do I start from?!
 Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
 Yarn vs. Mesos vs. Stand Alone
Running Environments
 Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
 Cluster Utilization
◦ Unified Cluster for all environments
 Vs.
◦ Cluster per Environment
 (Cluster per Data Center)
 Configuration
◦ Local Files vs. Distributed
Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
 HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
 S3
◦ High latency and pretty slow but low costs
 Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
 Automation via Jenkins
 Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
Build Automation
 Maven
◦ Sonatype Nexus artifact management
 -
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
Workflow Management
 Oozie – Very hard to integrate with Spark
◦ XML configuration based and not that convenient
 Azkaban (Haven’t tried it)
 Chosen:
◦ Luigi
◦ Crontab + Jenkins (KISS again)
Testing
Dev Testing
Live
Staging
Production
Testing
 Unit
◦ JUnit tests that run on the Spark “Functions”
 End to End
◦ Simulate the full execution of an application on a
single JVM (local mode) – Real input, Real output
 Functional
◦ Stand alone application
◦ Running on the cluster
◦ Minimal coverage – Shows working data flow
Logging
 Runs by default log4j (slf4j)
 How to log correctly:
◦ Separate logs for different applications
◦ Driver and Executors log to different locations
◦ Yarn logging also exists (Might find problems there too)
 ELK Stack (Logstash - ElasticSearch – Kibana)
◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2)
◦ DO NOT use the regular TCP Log4J appender
Reporting and Monitoring
 Graphite
◦ Online application metrics
 Grafana
◦ Good Graphite visualization
 Jenkins - Monitoring
◦ Scheduled tests
◦ Validate result set of the applications
◦ Hung or stuck applications
◦ Failed application
Reporting and Monitoring
 Grafana + Graphite - Example
Summary
Cluster
Dev Testing
Live
Staging
ProductionEnv
ELK
Data Flow
Extern
al Data
Source
s
Analytics Layers Data Output
Conclusion
 Spark is a popular and very powerful
distributed in memory computation
framework
 Broadly used and has lots of contributors
 Leading tool in the new world of Petabytes
of unexplored data in the world
Questions?
Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

Spark 101 - First steps to distributed computing

  • 1.
  • 2.
    About me Demi Ben-Ari SeniorSoftware Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  • 4.
    Agenda  What isSpark?  Spark Infrastructure and Basics  Spark Features and Suite  Development with Spark  Conclusion
  • 5.
    What is Spark? EfficientUsable  General execution graphs  In-memory storage  Rich APIs in Java, Scala, Python  Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
  • 6.
    What is Spark? Apache Spark is a general-purpose, cluster computing framework  Spark does computation In Memory & on Disk  Apache Spark has low level and high level APIs
  • 7.
    About Spark project Spark was founded at UC Berkeley and the main contributor is “Databricks”.  Interactive shell Spark in Scala and Python ◦ (spark-shell, pyspark)  Currently stable in version 1.5
  • 8.
    Spark Philosophy  Makelife easy and productive for data scientists  Well documented, expressive API’s  Powerful domain specific libraries  Easy integration with storage systems  … and caching to avoid data movement  Predictable releases, stable API’s  Stable release each 3 months
  • 9.
  • 10.
  • 11.
    Spark Core Features Distributed In memory Computation  Stand alone and Local Capabilities  History server for Spark UI  Resource management Integration  Unified job submission tool
  • 13.
    Spark Contributors  Highlyactive open source community (09/2015) ◦ https://github.com/apache/spark/  https://www.openhub.net/p/apache-spark
  • 14.
  • 15.
    Basic Terms  Cluster(Master, Slaves)  Driver  Executors  Spark Context  RDD – Resilient Distributed Dataset
  • 16.
  • 17.
  • 18.
    Spark execution engine Spark uses lazy evaluation  Runs the code only when it encounters an action operation  There is no need to design and write a single complex map-reduce job.  In Spark we can write smaller and manageable operations ◦ Spark will group operations together
  • 19.
    Spark execution engine Serializes your code to the executors ◦ Can choose your serialization method (Java serialization, Kryo)  In Java - functions are specified as objects that implement one of Spark’s Function interfaces. ◦ Can use the same method of implementation in Scala and Python as well.
  • 20.
  • 21.
    Persistence layers forSpark  Distributed system ◦ Hadoop (HDFS) ◦ Local file system ◦ Amazon S3 ◦ Cassandra ◦ Hive ◦ Hbase  File formats ◦ Text file  CSV, TSV, Plain Text ◦ Sequence File ◦ AVRO ◦ Parquet
  • 24.
    History Server  Canbe run on all Spark deployments, ◦ Stand Alone, YARN, Mesos  Integrates both with YARN and Mesos  In Yarn / Mesos, run history server as a daemon.
  • 25.
    Job Submission Tool ./bin/spark-submit <app-jar> --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster
  • 26.
    Multi Language APISupport  Scala  Java  Python  Clojure
  • 27.
    Spark Shell  YouTube– Word Count Example
  • 28.
    Cassandra & Spark Cassandra cluster ◦ Bare metal vs. On the cloud  DSE – DataStax Enterprise ◦ Cassandra & Spark in each node  Vs ◦ Separate Cassandra and Spark clusters
  • 29.
  • 30.
    Where do Istart from?!  Download spark as a package ◦ Run it on “local” mode (no need of a real cluster) ◦ “spark-ec2” scripts to ramp-up a Stand Alone mode cluster ◦ Amazon Elastic Map Reduce (EMR)  Yarn vs. Mesos vs. Stand Alone
  • 31.
    Running Environments  Development– Testing – Production ◦ Don’t you need more? ◦ Be as flexible as you can  Cluster Utilization ◦ Unified Cluster for all environments  Vs. ◦ Cluster per Environment  (Cluster per Data Center)  Configuration ◦ Local Files vs. Distributed
  • 32.
    Saving and Maintainingthe Data Local File System – Not effective in a distributed environment  HDFS ◦ Might be very Expensive ◦ Locality Rules – Spark + HDFS node + Same machine  S3 ◦ High latency and pretty slow but low costs  Cassandra ◦ Rigid data model ◦ Very fast and depends on the Volume of the data can be
  • 33.
    DevOps – KeepIt Simple, Stupid Linux ◦ Bash scripts ◦ Crontab  Automation via Jenkins  Continuous Deployment – with every GIT push Dev Testing Live Staging Production Daily ManualAutomaticAutomatic
  • 34.
    Build Automation  Maven ◦Sonatype Nexus artifact management  - ◦ Deploy and Script generation scripts ◦ Per Environment Testing ◦ Data Validation ◦ Scheduled Tasks
  • 35.
    Workflow Management  Oozie– Very hard to integrate with Spark ◦ XML configuration based and not that convenient  Azkaban (Haven’t tried it)  Chosen: ◦ Luigi ◦ Crontab + Jenkins (KISS again)
  • 36.
  • 37.
    Testing  Unit ◦ JUnittests that run on the Spark “Functions”  End to End ◦ Simulate the full execution of an application on a single JVM (local mode) – Real input, Real output  Functional ◦ Stand alone application ◦ Running on the cluster ◦ Minimal coverage – Shows working data flow
  • 38.
    Logging  Runs bydefault log4j (slf4j)  How to log correctly: ◦ Separate logs for different applications ◦ Driver and Executors log to different locations ◦ Yarn logging also exists (Might find problems there too)  ELK Stack (Logstash - ElasticSearch – Kibana) ◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2) ◦ DO NOT use the regular TCP Log4J appender
  • 39.
    Reporting and Monitoring Graphite ◦ Online application metrics  Grafana ◦ Good Graphite visualization  Jenkins - Monitoring ◦ Scheduled tests ◦ Validate result set of the applications ◦ Hung or stuck applications ◦ Failed application
  • 40.
    Reporting and Monitoring Grafana + Graphite - Example
  • 41.
  • 42.
  • 43.
    Conclusion  Spark isa popular and very powerful distributed in memory computation framework  Broadly used and has lots of contributors  Leading tool in the new world of Petabytes of unexplored data in the world
  • 44.
  • 45.
    Thanks, Resources and Contact Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

Editor's Notes

  • #6  Generalize the map/reduce framework
  • #41 Jenkins - Monitoring Scheduled tests Validate result set of the applications Hung or stuck applications Failed application