Mongo db &_spark

HDFS
YARN
Distributed Resources

HDFS
YARN
MapReduce
Distributed Processing

HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce

Interactive Shell
Easy (-er)
Caching

HDFS
YARN
SparkHadoop
Distributed Processing

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming

Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming

Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN

Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming

execut
or
Worker Node
execut
or
Worker Node
Driver
Resilient Distributed Datasets

Resilient Distributed Datasets
f(x’’) = yParellelize = xt(x) = x’t(x’) = x’’

t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Parallelization

Transformations

Tranformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )

Action

Actions
collect()
count()
first()
take( n )
reduce( function )

f(x) = x’f(x’) = x’’t(x’’) = x’’’Parellelize = x
Lineage

https://github.com/mongodb/mongo-hadoop

{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"mailbox" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}

{
"_id" : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
"value" : 2
}
{
"_id" : "kmccomb@austin-mccomb.com|brian@enron.com",
"value" : 2
}
{
"_id" : "sally.beck@enron.com|sandy.stone@enron.com",
"value" : 2
}

Eratosthenes
Democritus
Hypatia
Shemp
Euripides

Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.MongoInputFormat”
);
conf.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);

Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);

Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar

Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit
--class com.mongodb.spark.examples.DataframeExample
--master local Examples-1.0-SNAPSHOT.jar

JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple) {
BSONObject header =
(BSONObject)tuple._2.get("headers");
Message m = new Message();
m.setTo( (String) header.get("To") );
m.setX_From( (String) header.get("From") );
m.setMessage_ID( (String) header.get( "Message-ID" ) );
m.setBody( (String) tuple._2.get( "body" ) );
return m;
}
}
);

THANK
S!{ Name: ‘Bryan Reinero’,
Title: ‘Developer
Advocate’,
Twitter: ‘@blimpyacht’,
Email:
‘bryan@mongdb.com’ }

Mongo db &amp;_spark

More Related Content

What's hot

Viewers also liked

Similar to Mongo db &amp;_spark

More from Bryan Reinero

Recently uploaded

Mongo db &amp;_spark

Editor's Notes

Mongo db &_spark

Similar to Mongo db &_spark

Mongo db &_spark