Apache spark Intro

Apache Spark Intro
workshop
BigData Romania

Apache Spark Intro
★ Apache Spark history
★ RDD
★ Transformations
★ Actions
★ Hands-on session

Apache Spark History
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/

From where to learn Spark ?
http://spark.apache.org/
http://shop.oreilly.com/product/0636920028512.do

Easy ways to run Spark ?
★ your IDE (ex. Eclipse or IDEA)
★ Standalone Deploy Mode: simplest way to deploy Spark
on a single machine
★ Docker & Zeppelin
★ EMR
★ Hadoop vendors (Cloudera, Hortonworks)

Spark basics
★ RDD
★ Operations : Transformations and Actions

RDD
An RDD is simply an immutable distributed collection of
objects!
b c d ge f ih kj ml ona qp

Creating RDD (I)
Python
lines = sc.parallelize([“workshop”, “spark”])
Scala
val lines = sc.parallelize(List(“workshop”, “spark”))
Java
JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

Creating RDD (II)
Python
lines = sc.textFile(“/path/to/file.txt”)
Scala
val lines = sc.textFile(“/path/to/file.txt”)
Java
JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP

Other data structures in Spark
★ Paired RDD
★ DataFrame
★ DataSet

Paired RDD
Paired RDD = an RDD of key/value pairs
user1 user2 user3 user4 user5
id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

Spark operations
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
Action
Transformation

Transformations
RDD 1
RDD 2
Transformations describe how to transform an RDD into
another RDD.
RDD 1
RDD 2

Transformations
RDD 1
RDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4

Popular transformations
★ map
★ filter
★ sample
★ union
★ distinct
★ groupByKey
★ reduceByKey
★ sortByKey
★ join

Actions
Actions compute a result from an RDD !
RDD 1

Actions
InputRDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()

Popular actions
★ collect
★ count
★ first
★ take
★ takeSample
★ countByKey
★ saveAsTextFile

Transformations and Actions
users
administrators
filter
take(3)

users
administrators
filter()
take(3) saveAsTextFile()

users
administrators
filter()
take(3) saveAsTextFile()
persist()

Lazy initialization
users
administrators
filter
take(3)

How Spark Executes Your Program

MovieLens
MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
Download link : http://grouplens.org/datasets/movielens/

MovieLens dataset
user
user_id
age
gender
occupation
zipcode
user_rating
user_id
movie_id
rating
timestamp
movie
movie_id
title
release_date
video_release
imdb_url
genres...

Exercises already solved !
★ Return only the users with occupation ‘administrator’
★ Increase the age of each user by one
★ Join user and rating datasets by user id

Exercises to solve
★ How many men/women register to MovieLens
★ Distribution of age for male/female registered to
MovieLens
★ Which are the movies names with rating x?
★ Average rating by movies
★ Sort users by their occupation

Congrats if you reached this slide

Apache spark Intro

More Related Content

What's hot

Viewers also liked

Similar to Apache spark Intro

Recently uploaded

Apache spark Intro

Editor's Notes