KEMBAR78
Introduction to apache spark | PPTX
Introduction to
Apache Spark
LDC – Tech Talk
John H. T. de Godoi
jgodoi1
What is Apache Spark?
• It is an engine of distributed high performance memory
processing that allow to work with Big Data iteratively
and interactively.
• It offers a way of creating a cluster which is going to
allow processing distributed code.
• And also it offers an easy way to create those code from
a framework with a functional api.
Components of Spark
• Spark comes with differents
components
• The core is the base to the others
• While the others offers features that
helps to use spark engine to a certain
purpose
Functional API
• Functional paradigm tells that a function is a basic unit
of this paradigm
• Into functional paradigm you can’t change states and
have mutable data
• An operation generates a new instance with the results
• A commom caracteristic is recursion being a important
tool to work
• Pure and High order functions
Map Reduce in Spark
• Map Reduce, mostly known from Hadoop MapReduce
framework, is a way of programming parallel and
distributed code which devides the process into tasks
that we classify as transformations (map) and actions
(reduce).
• In Spark those concepts are also applied where
transformations functions are lazy, and which calls goes
to a pile, and actions are launches the entire execution
pile including the action called.
Resilient Distributed Datasets
• Data structure in Apache Spark that keeps imutable data
distributed at Spark cluster
• It is resilient because Apache Spark keeps ways of
recalculating any possible missing data
SparkContext
and SparkConf
• To work with Apache Spark is required to create a
context before anything else
• It is going to provide ways of creating RDD’s and other
structures
• And to create a SparkContext is required a SparkConf
that is going to contains information related
configuration about your application
val conf = new SparkConf()
.setAppName("workdCount")
.setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("OFF")
val nums =
sc.parallelize(List(1,2,3,4))
var file = sc.textFile("build.sbt")
Transformations
• map
• flatMap
• filter
• union
• intersection
• distinct
• coalesce
• cache*
• persist
val squared = nums.map(x=>x*x)
file = file.filter(x => !x.isEmpty)
.filter(x => x.contains(":="))
.map(x => x.split(":=")(1).trim)
Actions
• foreach
• collect
• reduce
• Count
• first
• take
• saveAsTextFile
println(squared.sum())
file.foreach(str => {
println(str.toUpperCase())
})
Ways of working with Spark
• spark-shell (EDL)
• spark-submit (EDL)
• Databricks (not authorized in JnJ)
• Amazon EMR (VPCX)
Look and feel

Introduction to apache spark

  • 1.
    Introduction to Apache Spark LDC– Tech Talk John H. T. de Godoi jgodoi1
  • 2.
    What is ApacheSpark? • It is an engine of distributed high performance memory processing that allow to work with Big Data iteratively and interactively. • It offers a way of creating a cluster which is going to allow processing distributed code. • And also it offers an easy way to create those code from a framework with a functional api.
  • 3.
    Components of Spark •Spark comes with differents components • The core is the base to the others • While the others offers features that helps to use spark engine to a certain purpose
  • 4.
    Functional API • Functionalparadigm tells that a function is a basic unit of this paradigm • Into functional paradigm you can’t change states and have mutable data • An operation generates a new instance with the results • A commom caracteristic is recursion being a important tool to work • Pure and High order functions
  • 5.
    Map Reduce inSpark • Map Reduce, mostly known from Hadoop MapReduce framework, is a way of programming parallel and distributed code which devides the process into tasks that we classify as transformations (map) and actions (reduce). • In Spark those concepts are also applied where transformations functions are lazy, and which calls goes to a pile, and actions are launches the entire execution pile including the action called.
  • 6.
    Resilient Distributed Datasets •Data structure in Apache Spark that keeps imutable data distributed at Spark cluster • It is resilient because Apache Spark keeps ways of recalculating any possible missing data
  • 7.
    SparkContext and SparkConf • Towork with Apache Spark is required to create a context before anything else • It is going to provide ways of creating RDD’s and other structures • And to create a SparkContext is required a SparkConf that is going to contains information related configuration about your application val conf = new SparkConf() .setAppName("workdCount") .setMaster("local") val sc = new SparkContext(conf) sc.setLogLevel("OFF") val nums = sc.parallelize(List(1,2,3,4)) var file = sc.textFile("build.sbt")
  • 8.
    Transformations • map • flatMap •filter • union • intersection • distinct • coalesce • cache* • persist val squared = nums.map(x=>x*x) file = file.filter(x => !x.isEmpty) .filter(x => x.contains(":=")) .map(x => x.split(":=")(1).trim)
  • 9.
    Actions • foreach • collect •reduce • Count • first • take • saveAsTextFile println(squared.sum()) file.foreach(str => { println(str.toUpperCase()) })
  • 10.
    Ways of workingwith Spark • spark-shell (EDL) • spark-submit (EDL) • Databricks (not authorized in JnJ) • Amazon EMR (VPCX)
  • 11.