KEMBAR78
Reactive dashboard’s using apache spark | PDF
Reactive Dashboards Using
Apache Spark
Rahul Kumar
Software Developer
@rahul_kumar_aws
LinuxCon, CloudOpen, ContainerCon North America 2015
Agenda
• Dashboards
• Big Data Introduction
• Apache Spark
• Introduction to Reactive Applications
• Reactive Platform
• Live Demo
Dashboards
A dashboard is a visual display of the most important
information needed to achieve one or more objectives;
consolidated and arranged on a single screen so the information
can be monitored at a glance*.
* Stephen Few’s definition of a dashboard
Key characteristics of a dashboard
•All component should fit in a single screen
•Interactivity such as filtering, drill down can be used.
•The displayed data automatically updated without any
assistance from the user.
4
5* image source google image search
Google Analytics
6* image source google image search
AWS CloudWatch
7
Google Compute Engine
A typical database application
Sub
second
response
Multi
Source
Data
Ingestion
Gb’s to
Petabyte
Data
Realtime
update
Scalable
Three V’s of Big Data
Scale vertically (scale up)
Scale horizontally (scale out)
Apache
Apache Spark is a fast and general engine for large-scale data processing.
Speed
Easy to
Use
Generality
Runs
Everywhere
& many more..
File Format supports
15
CSV
TSV
JSON
ORC
Apache Stack
17
Spark Log Analysis
• Apache Spark Setup
• Interaction with Spark Shell
• Setup a Spark App
• RDD Introduction
• Deploy Spark app on Cluster
Prerequisite for cluster setup
Spark runs on Java 6+, Python 2.6+ and R 3.1+.
For the Scala API, Spark 1.4.1 uses Scala 2.10.
Java 8
sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
Scala 1.10.4
http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$tar -xvzf scala-2.10.4.tgz
vim ~/.bashrc
export SCALA_HOME=/home/ubuntu/scala-2.10.4
export PATH=$PATH:$SCALA_HOME/bin
Spark Cluster
Spark Setup
http://spark.apache.org/downloads.html
Running Spark Example & Shell
$ cd spark-1.4.1-bin-hadoop2.6
$./bin/run-example SparkPi 10
cd spark-1.4.1-bin-hadoop2.6
spark-1.4.1-bin-hadoop2.6 $ ./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run locally with
one thread, or local[N] to run locally with N threads.
RDD Introduction
Resilient
Distributed
Data Set
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant
manner.
RDD shard the data over a cluster, like a virtualized, distributed collection.
Users create RDDs in two ways: by loading an external dataset, or by distributing
a collection of objects such as List, Map etc.
RDD Operations
RDDs support two types of operations: transformations and actions.
Spark computes RDD only in a lazy fashion.
Only computation start when an Action call on RDD.
● Simple SBT project setup https://github.com/rahulkumar-­‐aws/HelloWorld
$ mkdir HelloWorld
$ cd HelloWorld
$ mkdir -p src/main/scala
$ mkdir -p src/main/resources
$ mkdir -p src/test/scala
$ vim build.sbt
name := “HelloWorld”
version := “1.0”
scalaVersion := “2.10.4”
$ mkdir project
$ cd project
$ vim build.properties
sbt.version=0.13.8
$ vim scr/main/scala/HelloWorld.scala
object HelloWorld { def main(args: Array[String]) = println("HelloWorld!") }
$ sbt run
First Spark Application
$git clone https://github.com/rahulkumar-aws/WordCount.git
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkWordCount {
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local","SparkWordCount")
val wordsCounted = sc.textFile(args(0)).map(line=> line.toLowerCase)
.flatMap(line => line.split("""W+"""))
.groupBy(word => word)
.map{ case(word, group) => (word, group.size)}
wordsCounted.saveAsTextFile(args(1))
sc.stop()
}
}
$sbt "run-main ScalaWordCount src/main/resources/sherlockholmes.txt out"
Launching Spark on Cluster
Spark Cache Introduction
Spark supports pulling data sets into a cluster-wide in-memory cache.
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at <console>:21
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at filter at
<console>:23
scala> linesWithSpark.cache()
res11: linesWithSpark.type = MapPartitionsRDD[13] at filter at <console>:23
scala> linesWithSpark.count()
res12: Long = 19
Spark SQL Introduction
Spark SQL is Spark's module for working with structured data.
● Mix SQL queries with Spark programs
● Uniform Data Access, Connect to any data source
● DataFrames and SQL provide a common way to access a variety of data sources,
including Hive,
Avro,
Parquet,
ORC,
JSON,
and JDBC.
● Hive Compatibility Run unmodified Hive queries on existing data.
● Connect through JDBC or ODBC.
Spark Streaming Introduction
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
$git clone https://github.com/rahulkumar-aws/WordCount.git
$ nc -lk 9999
sbt "run-main StreamingWordCount"
Reactive Application
• Responsive
• Resilient
• Elastic
• Event Driven
http://www.reactivemanifesto.org
Typesafe Reactive Platform
Play Framework
The High Velocity Web Framework For Java and Scala
● RESTful by default
● JSON is a first class citizen
● Web sockets, Comet, EventSource
● Extensive NoSQL & Big Data Support
https://www.playframework.com/download
https://downloads.typesafe.com/typesafe-activator/1.3.5/typesafe-activator-1.3.5-minimal.zip
Akka
Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient
message-driven applications on the JVM.
● Simple Concurrency & Distribution
● Resilient by Design
● High Performance
● Elastic & Decentralised
● Extensible
Akka uses Actor Model that raise the abstraction level and provide a better
platform to build scalable, resilient and responsive applications.
Demo
References
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
http://spark.apache.org/docs/latest/quick-start.html
Learning Spark Lightning-Fast Big Data Analysis
By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
https://www.playframework.com/documentation/2.4.x/Home
http://doc.akka.io/docs/akka/2.3.12/scala.html
Thank You
Rahul Kumar rahul.k@sigmoid.com @rahul_kumar_aws

Reactive dashboard’s using apache spark

  • 1.
    Reactive Dashboards Using ApacheSpark Rahul Kumar Software Developer @rahul_kumar_aws LinuxCon, CloudOpen, ContainerCon North America 2015
  • 2.
    Agenda • Dashboards • BigData Introduction • Apache Spark • Introduction to Reactive Applications • Reactive Platform • Live Demo
  • 3.
    Dashboards A dashboard isa visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance*. * Stephen Few’s definition of a dashboard
  • 4.
    Key characteristics ofa dashboard •All component should fit in a single screen •Interactivity such as filtering, drill down can be used. •The displayed data automatically updated without any assistance from the user. 4
  • 5.
    5* image sourcegoogle image search Google Analytics
  • 6.
    6* image sourcegoogle image search AWS CloudWatch
  • 7.
  • 8.
    A typical databaseapplication
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Apache Apache Spark isa fast and general engine for large-scale data processing. Speed Easy to Use Generality Runs Everywhere
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    • Apache SparkSetup • Interaction with Spark Shell • Setup a Spark App • RDD Introduction • Deploy Spark app on Cluster
  • 19.
    Prerequisite for clustersetup Spark runs on Java 6+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.4.1 uses Scala 2.10. Java 8 sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer Scala 1.10.4 http://www.scala-lang.org/files/archive/scala-2.10.4.tgz $tar -xvzf scala-2.10.4.tgz vim ~/.bashrc export SCALA_HOME=/home/ubuntu/scala-2.10.4 export PATH=$PATH:$SCALA_HOME/bin Spark Cluster
  • 20.
  • 22.
    Running Spark Example& Shell $ cd spark-1.4.1-bin-hadoop2.6 $./bin/run-example SparkPi 10
  • 23.
    cd spark-1.4.1-bin-hadoop2.6 spark-1.4.1-bin-hadoop2.6 $./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads.
  • 25.
    RDD Introduction Resilient Distributed Data Set ResilientDistributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDD shard the data over a cluster, like a virtualized, distributed collection. Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects such as List, Map etc.
  • 26.
    RDD Operations RDDs supporttwo types of operations: transformations and actions. Spark computes RDD only in a lazy fashion. Only computation start when an Action call on RDD.
  • 27.
    ● Simple SBTproject setup https://github.com/rahulkumar-­‐aws/HelloWorld $ mkdir HelloWorld $ cd HelloWorld $ mkdir -p src/main/scala $ mkdir -p src/main/resources $ mkdir -p src/test/scala $ vim build.sbt name := “HelloWorld” version := “1.0” scalaVersion := “2.10.4” $ mkdir project $ cd project $ vim build.properties sbt.version=0.13.8 $ vim scr/main/scala/HelloWorld.scala object HelloWorld { def main(args: Array[String]) = println("HelloWorld!") } $ sbt run
  • 28.
    First Spark Application $gitclone https://github.com/rahulkumar-aws/WordCount.git import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SparkWordCount { def main(args: Array[String]): Unit = { val sc = new SparkContext("local","SparkWordCount") val wordsCounted = sc.textFile(args(0)).map(line=> line.toLowerCase) .flatMap(line => line.split("""W+""")) .groupBy(word => word) .map{ case(word, group) => (word, group.size)} wordsCounted.saveAsTextFile(args(1)) sc.stop() } } $sbt "run-main ScalaWordCount src/main/resources/sherlockholmes.txt out"
  • 29.
  • 31.
    Spark Cache Introduction Sparksupports pulling data sets into a cluster-wide in-memory cache. scala> val textFile = sc.textFile("README.md") textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at <console>:21 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at filter at <console>:23 scala> linesWithSpark.cache() res11: linesWithSpark.type = MapPartitionsRDD[13] at filter at <console>:23 scala> linesWithSpark.count() res12: Long = 19
  • 33.
    Spark SQL Introduction SparkSQL is Spark's module for working with structured data. ● Mix SQL queries with Spark programs ● Uniform Data Access, Connect to any data source ● DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. ● Hive Compatibility Run unmodified Hive queries on existing data. ● Connect through JDBC or ODBC.
  • 35.
    Spark Streaming Introduction SparkStreaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • 36.
    $git clone https://github.com/rahulkumar-aws/WordCount.git $nc -lk 9999 sbt "run-main StreamingWordCount"
  • 37.
    Reactive Application • Responsive •Resilient • Elastic • Event Driven http://www.reactivemanifesto.org
  • 39.
  • 40.
    Play Framework The HighVelocity Web Framework For Java and Scala ● RESTful by default ● JSON is a first class citizen ● Web sockets, Comet, EventSource ● Extensive NoSQL & Big Data Support https://www.playframework.com/download https://downloads.typesafe.com/typesafe-activator/1.3.5/typesafe-activator-1.3.5-minimal.zip
  • 41.
    Akka Akka is atoolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. ● Simple Concurrency & Distribution ● Resilient by Design ● High Performance ● Elastic & Decentralised ● Extensible Akka uses Actor Model that raise the abstraction level and provide a better platform to build scalable, resilient and responsive applications.
  • 42.
  • 43.
    References https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf http://spark.apache.org/docs/latest/quick-start.html Learning Spark Lightning-FastBig Data Analysis By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia https://www.playframework.com/documentation/2.4.x/Home http://doc.akka.io/docs/akka/2.3.12/scala.html
  • 44.
    Thank You Rahul Kumarrahul.k@sigmoid.com @rahul_kumar_aws