KEMBAR78
Mongo db &_spark | PPTX
& Spark
& Spark
Level Setting
TROUGH OF
Disillusionment
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
MapReduce
Distributed Processing
HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce
Interactive Shell
Easy (-er)
Caching
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
SparkHadoop
Distributed Processing
HDFS
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
execut
or
Worker Node
execut
or
Worker Node
Driver
Resilient Distributed Datasets
Resilient Distributed Datasets
f(x’’) = yParellelize = xt(x) = x’t(x’) = x’’
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Parallelization
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Transformations
Tranformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Action
Actions
collect()
count()
first()
take( n )
reduce( function )
f(x) = x’f(x’) = x’’t(x’’) = x’’’Parellelize = x
Lineage
Lineage
Lineage
Lineage
Lineage
https://github.com/mongodb/mongo-hadoop
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"mailbox" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
{
"_id" : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
"value" : 2
}
{
"_id" : "kmccomb@austin-mccomb.com|brian@enron.com",
"value" : 2
}
{
"_id" : "sally.beck@enron.com|sandy.stone@enron.com",
"value" : 2
}
Eratosthenes
Democritus
Hypatia
Shemp
Euripides
Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.MongoInputFormat”
);
conf.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
mongos mongos
Data Services
Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar
Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit 
--class com.mongodb.spark.examples.DataframeExample 
--master local Examples-1.0-SNAPSHOT.jar
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple) {
BSONObject header =
(BSONObject)tuple._2.get("headers");
Message m = new Message();
m.setTo( (String) header.get("To") );
m.setX_From( (String) header.get("From") );
m.setMessage_ID( (String) header.get( "Message-ID" ) );
m.setBody( (String) tuple._2.get( "body" ) );
return m;
}
}
);
& Spack
DEMO
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
mongos mongos
Data Services
THANK
S!{ Name: ‘Bryan Reinero’,
Title: ‘Developer
Advocate’,
Twitter: ‘@blimpyacht’,
Email:
‘bryan@mongdb.com’ }

Mongo db &amp;_spark

Editor's Notes

  • #29 A fault-tolerant collection of elements operated on in parallel best suited for batch applications
  • #45 MongoInputFormat allows us to read from a live MongoDB instance. We could also use BSONFileInputFormat to read BSON snapshots.
  • #46 JavaPamongodbConfig
  • #47 JavaPamongodbConfig
  • #48 JavaPamongodbConfig
  • #49 JavaPamongodbConfig
  • #50 JavaPamongodbConfig