KEMBAR78
Apache spark Intro | PPTX
Apache Spark Intro
workshop
BigData Romania
Apache Spark Intro
★ Apache Spark history
★ RDD
★ Transformations
★ Actions
★ Hands-on session
Apache Spark History
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/
From where to learn Spark ?
http://spark.apache.org/
http://shop.oreilly.com/product/0636920028512.do
Spark architecture
Easy ways to run Spark ?
★ your IDE (ex. Eclipse or IDEA)
★ Standalone Deploy Mode: simplest way to deploy Spark
on a single machine
★ Docker & Zeppelin
★ EMR
★ Hadoop vendors (Cloudera, Hortonworks)
Supported languages
Spark basics
★ RDD
★ Operations : Transformations and Actions
RDD
An RDD is simply an immutable distributed collection of
objects!
b c d ge f ih kj ml ona qp
Creating RDD (I)
Python
lines = sc.parallelize([“workshop”, “spark”])
Scala
val lines = sc.parallelize(List(“workshop”, “spark”))
Java
JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))
Creating RDD (II)
Python
lines = sc.textFile(“/path/to/file.txt”)
Scala
val lines = sc.textFile(“/path/to/file.txt”)
Java
JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)
RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP
Other data structures in Spark
★ Paired RDD
★ DataFrame
★ DataSet
Paired RDD
Paired RDD = an RDD of key/value pairs
user1 user2 user3 user4 user5
id1/user1 id2/user2 id3/user3 id4/user4 id5/user5
Spark operations
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
Action
Transformation
Transformations
RDD 1
RDD 2
Transformations describe how to transform an RDD into
another RDD.
RDD 1
RDD 2
Transformations
RDD 1
RDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4
Popular transformations
★ map
★ filter
★ sample
★ union
★ distinct
★ groupByKey
★ reduceByKey
★ sortByKey
★ join
Actions
Actions compute a result from an RDD !
RDD 1
Actions
InputRDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()
Popular actions
★ collect
★ count
★ first
★ take
★ takeSample
★ countByKey
★ saveAsTextFile
Transformations and Actions
users
administrators
filter
take(3)
Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()
Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()
persist()
Lazy initialization
users
administrators
filter
take(3)
How Spark Executes Your Program
Hands-on session
MovieLens
MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
Download link : http://grouplens.org/datasets/movielens/
MovieLens dataset
user
user_id
age
gender
occupation
zipcode
user_rating
user_id
movie_id
rating
timestamp
movie
movie_id
title
release_date
video_release
imdb_url
genres...
Exercises already solved !
★ Return only the users with occupation ‘administrator’
★ Increase the age of each user by one
★ Join user and rating datasets by user id
Exercises to solve
★ How many men/women register to MovieLens
★ Distribution of age for male/female registered to
MovieLens
★ Which are the movies names with rating x?
★ Average rating by movies
★ Sort users by their occupation
Congrats if you reached this slide

Apache spark Intro

Editor's Notes

  • #2 As a BigData/DataScience community, we did a series of workshops (ex. How to think in MapReduce, Hive, Machine Learning) and meetups with the goals to meet and know each other and to grow the knowledge in this fields. I saw that more and more companies from Cluj start to play with this kind of technologies, so if you like it, I believe that very shortly you will work on some very cool and challenging projects. first goal is to see how easy you can start working with Spark second goal is see and try Spark main functionalities OR in the next two hours we are gonna show you how easy is to start working with Spark, first how you can use just only your IDE (Eclipse or Idea) to work with Spark without any cluster deployment and second, of course, we are gonna describe Spark main functionalities and try some practical examples
  • #3 how spark really works : https://www.quora.com/What-exactly-is-Apache-Spark-and-how-does-it-work https://spark.apache.org/research.html
  • #7 change them with EMR, Hadoop vendors (Cloudera, Hortonworks) or standalone, or just in your IDE
  • #10 RDD is the core concept in Spark a Spark program is based on creating an RDD, transforming on RDD or performing an action on RDD to get the results another RDD definition : “RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.” TODO why RDD is immutable ? (concurrency stuff ? )
  • #11 spark provide two ways to create RDDs : loading an external dataset and parallelizing a collection
  • #12 spark provide two ways to create RDDs : loading an external dataset and parallelizing a collection
  • #13  NOTES user can specify rdd priority, specifying which rdd to be first splitted to disk if the memory start to fill.
  • #14 paired rdd - an RDD made of tuple objects DataFrame - an RDD made of Rows and has associated a schema (similar with tables in SQL) DataSet - combine the bests from RDDs and DataFrames
  • #16 - explain this DAG using a example, (filter log files) - explain lazy initialization - explain fault tolerance Notes : rdd do not need to be materialized at all times, it has enough information (its lineage) to be computed from data in a stable storage corse grained transformation restrict RDD to appliations which do bulk writes operations but give an easy fault-tolerant strategy
  • #29 for this workshop we choose MovieLens dataset to play with. Using this dataset we will learn how to read files with spark, create RDDs and apply common transformations and actions on them. the dataset contains three files : user, rating and movies.