Introduction to apache spark

Introduction to
Apache Spark
LDC – Tech Talk
John H. T. de Godoi
jgodoi1

What is Apache Spark?
• It is an engine of distributed high performance memory
processing that allow to work with Big Data iteratively
and interactively.
• It offers a way of creating a cluster which is going to
allow processing distributed code.
• And also it offers an easy way to create those code from
a framework with a functional api.

Components of Spark
• Spark comes with differents
components
• The core is the base to the others
• While the others offers features that
helps to use spark engine to a certain
purpose

Functional API
• Functional paradigm tells that a function is a basic unit
of this paradigm
• Into functional paradigm you can’t change states and
have mutable data
• An operation generates a new instance with the results
• A commom caracteristic is recursion being a important
tool to work
• Pure and High order functions

Map Reduce in Spark
• Map Reduce, mostly known from Hadoop MapReduce
framework, is a way of programming parallel and
distributed code which devides the process into tasks
that we classify as transformations (map) and actions
(reduce).
• In Spark those concepts are also applied where
transformations functions are lazy, and which calls goes
to a pile, and actions are launches the entire execution
pile including the action called.

Resilient Distributed Datasets
• Data structure in Apache Spark that keeps imutable data
distributed at Spark cluster
• It is resilient because Apache Spark keeps ways of
recalculating any possible missing data

SparkContext
and SparkConf
• To work with Apache Spark is required to create a
context before anything else
• It is going to provide ways of creating RDD’s and other
structures
• And to create a SparkContext is required a SparkConf
that is going to contains information related
configuration about your application
val conf = new SparkConf()
.setAppName("workdCount")
.setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("OFF")
val nums =
sc.parallelize(List(1,2,3,4))
var file = sc.textFile("build.sbt")

Transformations
• map
• flatMap
• filter
• union
• intersection
• distinct
• coalesce
• cache*
• persist
val squared = nums.map(x=>x*x)
file = file.filter(x => !x.isEmpty)
.filter(x => x.contains(":="))
.map(x => x.split(":=")(1).trim)

Actions
• foreach
• collect
• reduce
• Count
• first
• take
• saveAsTextFile
println(squared.sum())
file.foreach(str => {
println(str.toUpperCase())
})

Ways of working with Spark
• spark-shell (EDL)
• spark-submit (EDL)
• Databricks (not authorized in JnJ)
• Amazon EMR (VPCX)

Introduction to apache spark

More Related Content

What's hot

Similar to Introduction to apache spark

More from John Godoi

Recently uploaded

Introduction to apache spark