MongoDB for Coder Training (Coding Serbia 2013)

Training:
MongoDB for Coder
Uwe Seiler
uweseiler

About me

Big Data Nerd

Hadoop Trainer MongoDB Author

Photography Enthusiast

Travelpirate

About us
is a bunch of…

Big Data Nerds

Agile Ninjas

Continuous Delivery Gurus

Join us!
Enterprise Java Specialists Performance Geeks

Agenda I
1. Introduction to NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD

with MongoDB
3. Indexing: Speed up your queries with

MongoDB
4. MapReduce: Data aggregation with

MongoDB

Agenda
5. Aggregation Framework: Data

aggregation done the MongoDB way
6. Replication: High Availability with

MongoDB
7. Sharding: Scaling with MongoDB

Ingredients

•

Slides

•

Live Coding

•

Discussion

•

Labs on your own computer

And please…

If you have
questions, please
share them with us!

And now start your downloads…

Lab files:
http://bit.ly/1aT8RXY

Classification of NoSQL
Key-Value Stores
K

V

K

V

K

V

K

1

V

K

Column Stores

V

Graph Databases

1

1
1
1

1
1
1

1
1
1

Document Stores
_id
_id
_id

The classic definition
•

The 3 V’s of Big Data

Volume Velocity •Variety

Vertical Scaling

RAM
CPU
Storage

Horizontal Scaling

RAM
CPU
Storage

Horizontal Scaling

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

Horizontal Scaling
RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

The problem
with
distributed
data

The CAP Theorem
Availability
a guarantee
that every
request
receives a
response

Consistency
all nodes see
the same data
at the same
time

Partition
Tolerance
failure of
single nodes
doesn‘t effect
the overall
system

Overview of NoSQL systems

Availability
a guarantee
that every
request
receives a
response

C

Partition
onsistency

Tolerance

all nodes see
the same data
at the same
time

failure of
single nodes
doesn‘t effect
the overall
system

ACID vs. BASE

1983

Atomicity RDBMS
Consistency
Isolation
Durability

ACID vs. BASE

ACID is a good
concept but it is not
a written law!

ACID vs. BASE

Basically Available
Soft State
2008

NoSQL

Eventually consistent

ACID vs. BASE
ACID

BASE

-

-

Strong consistency
Isolation & Transactions
Two-Phase-Commit
Complex Development
More reliable

Eventual consistency
Highly Available
"Fire-and-forget"
Eases development
Faster

MongoDB is a…
•

document

•

open source

•

highly performant

•

flexible

•

scalable

•

highly available

•

feature-rich

…database

Document Database
•

Not PDF, Word, etc. … JSON!

Open Source Database
•

MongoDB is a open source project

•

Available on GitHub
– https://github.com/mongodb/mongo

•

Uses the AGPL Lizenz

•

Started and sponsored by MongoDB Inc.
(prior: 10gen)

•

Commercial version and support available

•

Join the crowd!
– https://jira.mongodb.org

Performance

Data
locality

In-Memory
Caching

In-Place
Updates

Flexible Schema
RDBMS

MongoDB
{
_id :
ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type :
"Dental",
plan : "Standard" }
]
}

Scalability
Auto Sharding

• Increase capacity as you go
• Commodity and cloud architectures
• Improved operational simplicity and cost visibility

High Availability

• Automated replication and failover
• Multi-data center support
• Improved operational simplicity (e.g., HW swaps)
• Data durability and consistency

Map/Reduce
MongoDB

Data

Map()
emit(k,v)

Group(k)

Shard 1
Sort(k)
Shard 2

…
Shard
n

Reduce(k, values)

Finalize(k, v)

Driver & Shell
Drivers are available
for almost all popular
programming
languages and
frameworks

Java

JavaScript

Python

Shell to interact with the
database

Ruby

Perl

Haskell

> db.collection.insert({product:“MongoDB”,
type:“Document Database”})
>
> db.collection.findOne()
{
“_id”
: ObjectId(“5106c1c2fc629bfe52792e86”),
“product”
: “MongoDB”
“type”
: “Document Database”
}

NoSQL Trends
Google Search

LinkedIn Job Skills
MongoDB
Competitor 1
Competitor 2
Competitor 3
Competitor 4
Competitor 5

MongoDB

Competitor 2

Competitor 1

Competitor 4

Competitor 3

All Others

Jaspersoft Big Data Index

Indeed.com Trends
Top Job Trends

Direct Real-Time Downloads
MongoDB
Competitor 1
Competitor 2
Competitor 3

1.HTML 5
2.MongoDB
3.iOS
4.Android
5.Mobile Apps
6.Puppet
7.Hadoop
8.jQuery
9.PaaS
10.Social Media

Terminology
RDBMS
Table / View
Row
Index
Join
Foreign Key
Partition

MongoDB
➜
➜
➜
➜
➜
➜

Collection
Document
Index
Embedded document
Referenced document
Shard

MongoDB Collections
•

User

•

Article

•

Tag

•

Category

Create a database
// Show all databases
> show dbs
digg 0.078125GB
enron 1.49951171875GB
// Switch to a database
> use blog
// Show all databases again
> show dbs
digg 0.078125GB
enron 1.49951171875GB

Create a collection I
// Show all collections
> show collections
// Insert a user
> db.user.insert(
{ name : “Sheldon“,
mail : “sheldon@bigbang.com“ }
)

No feedback about the result of the insert,
use:
db.runCommand( { getLastError: 1} )

Create a collection II
// Show all collections
> show collections
system.indexes
user
// Show all databases
> show dbs
blog 0.0625GB
digg 0.078125GB
enron 1.49951171875GB

Databases and collections are
automatically created during
the first insert operation!

Read from a collection
// Show the first document
> db.user.findOne()
{
"_id" : ObjectId("516684a32f391f3c2fcb80ed"),
"name" : "Sheldon",
"mail" : "sheldon@bigbang.com"
}
// Show all documents of a collection
> db.user.find()
{
"name" : "Sheldon",
"mail" : "sheldon@bigbang.com"
}

Find documents
// Find a specific document
> db.user.find( { name : ”Penny” } )
{
"_id" : ObjectId("5166a9dc2f391f3c2fcb80f1"),
"name" : "Penny",
"mail" : "penny@bigbang.com"
}
// Show only certain fields of the document
> db.user.find( { name : ”Penny” },
{_id: 0, mail : 1} )
{ "mail" : "sheldon@bigbang.com" }

_id
•

_id is the primary key in MongoDB

•

_id is created automatically

•

If not specified differently, it‘s type is

ObjectId
•

_id can be specified by the user during the
insert of documents, but needs to be
unique (and can not be edited afterwards)

ObjectId
•

A ObjectId is a special 12 Byte value

•

It‘s uniqueness in the whole cluster is
guaranteed as following:
ObjectId("50804d0bd94ccab2da652599")
|-------------||---------||-----||----------|
ts
mac pid inc

Cursor
// Use a cursor with find()
> var myCursor = db.user.find( )
// Get the next document
> var myDocument =
myCursor.hasNext() ? myCursor.next() : null;
> if (myDocument) { printjson(myDocument.mail); }
// Show all other documents
> myCursor.forEach(printjson);

By default the shell displays
20 documents

Logical operators
// Find documents using OR
> db.user.find(
{$or : [ { name : “Sheldon“ },
{ mail : amy@bigbang.com }
]
})
// Find documents using AND
> db.user.find(
{$and : [ { name : “Sheldon“ },
{ mail : amy@bigbang.com }
]
})

Manipulating results
// Sort documents
> db.user.find().sort( { name : 1 } ) // Aufsteigend
> db.user.find().sort( { name : -1 } ) // Absteigend
// Limit the number of documents
> db.user.find().limit(3)
// Skip documents
> db.user.find().skip(2)
// Combination of both methods
> db.user.find().skip(2).limit(3)

Updating documents I
// Updating only the mail address (How not to do…)
> db.user.update( { name : “Sheldon“ },
{ mail : “sheldon@howimetyourmother.com“ }
)
// Result of the update operation
db.user.findOne()
{
"mail" : "sheldon@howimetyourmother.com"
}

Be careful when updating
documents!

Deleting documents
// Deleting a document
> db.user.remove(
{ mail : “sheldon@howimetyourmother.com“ }
)
// Deleting all documents in a collection
> db.user.remove()
// Use a condition to delete documents
> db.user.remove(
{ mail : /.*mother.com$/ } )
// Delete only the first document using a condition
> db.user.remove( { mail : /.*.com$/ }, true )

Updating documents II
// Updating only the mail address (This time for real)
{ $set : {
mail : “sheldon@howimetyourmother.com“
}})
// Show the result of the update operation
db.user.find(name : “Sheldon“)
{
"_id" : ObjectId("5166ba122f391f3c2fcb80f5"),
"mail" : "sheldon@howimetyourmother.com",
"name" : "Sheldon"
}

Adding to arrays
// Adding a array
> db.user.update( {name : “Sheldon“ },
{ $set : {enemies :
[ { name : “Wil Wheaton“ },
{ name : “Barry Kripke“ }
]
}})
// Adding a value to the array
> db.user.update( { name : “Sheldon“},
{ $push : {enemies :
{ name : “Leslie Winkle“}
}})

Deleting from arrays
// Deleting a value from an array
{$pull : {enemies :
{name : “Barry Kripke“ }
}})
// Deleting of a complete array
> db.user.update( {name : “Sheldon“},
{$unset : {enemies : 1}}
)

Adding a subdocument
// Adding a subdocument to an existing document
> db.user.update( { name : “Sheldon“}, {
$set : { mother :{ name : “Mary Cooper“,
residence : “Galveston, Texas“,
religion : “Evangelical Christian“ }}})
{
"_id" : ObjectId("5166cf162f391f3c2fcb80f7"),
"mail" : "sheldon@bigbang.com",
"mother" : {
"name" : "Mary Cooper",
"residence" : "Galveston, Texas",
"religion" : "Evangelical Christian"
},
"name" : "Sheldon"
}

Querying subdocuments
// Finding out the name of the mother
> db.user.find( { name : “Sheldon“},
{“mother.name“ : 1 } )
{
"_id" : ObjectId("5166cf162f391f3c2fcb80f7"),
"mother" : {
"name" : "Mary Cooper"
}
}

Compound field names need to
be in “…“!

Overview of all update operators
For fields:
$inc
$rename
$set
$unset
Bitwise operation:
$bit
Isolation:
$isolated

For arrays:
$addToSet
$pop
$pullAll
$pull
$pushAll
$push
$each (Modifier)
$slice (Modifier)
$sort (Modifier)

Dokumentation
Create
http://docs.mongodb.org/manual/core/create/

Read
http://docs.mongodb.org/manual/core/read/

Update
http://docs.mongodb.org/manual/core/update/

Delete
http://docs.mongodb.org/manual/core/delete/

Lab time!

Lab Nr. 02
Time box:
20 min

1

2

3

4

5

Chained lists

6

7

1

2

3

4

5

6

Find Nr. 7 in the chained list!

7

4
2

1

6

3

5

Find Nr. 7 in a tree!

7

Indices in MongoDB are B-Trees

Find, Insert and Delete Operations:

O(log(n))

Missing or non-optimal
indices are the singlemost avoidable
performance issue

How do I create an index?
// Create a non-existing index for a field
> db.recipes.createIndex({ main_ingredient: 1 })

// Make sure there is an index on the field
> db.recipes.ensureIndex({ main_ingredient: 1 })

* 1 for ascending, -1 for descending

What can be indexed?
// Multiple fields (Compound Key Indexes)
> db.recipes.ensureIndex({
main_ingredient: 1,
calories: -1
})
// Arrays with values (Multikey Indexes)
{
name: 'Chicken Noodle Soup’,
ingredients : ['chicken', 'noodles']
}
> db.recipes.ensureIndex({ ingredients: 1 })

What can be indexed?
// Subdocuments
{
name : 'Apple Pie',
contributor: {
name: 'Joe American',
id: 'joea123'
}
}
db.recipes.ensureIndex({ 'contributor.id': 1 })
db.recipes.ensureIndex({ 'contributor': 1 })

How to maintain indices?
// List all indices of a collection
> db.recipes.getIndexes()
> db.recipes.getIndexKeys()

// Drop an index
> db.recipes.dropIndex({ ingredients: 1 })

// Drop and recreate all indices of a collection
db.recipes.reIndex()

More options
•

Unique Index
– Allows only unique values in the indexed field(s)

•

Sparse Index
– For fields that are not available in all documents

•

Geospatial Index
– For modelling 2D and 3D geospatial indices

•

TTL Collections
– Are automatically deleted after x seconds

Unique Index
// Make sure the name of a recipe is unique
> db.recipes.ensureIndex( { name: 1 }, { unique: true } )

// Force an index on a collection with non-unique values
// Duplicates will be deleted more or less randomly!
> db.recipes.ensureIndex(
{ name: 1 },
{ unique: true, dropDups: true }
)

* dropDups should be used only with caution!

Sparse Index
// Only documents with the field calories will be indexed
{ calories: -1 },
{ sparse: true }
)
// Combination with unique index is possible
{ name: 1 , calories: -1 },
{ unique: true, sparse: true }
)
* Missing fields will be saved as null in the index!

Geospatial Index
// Add longitude and altitude
{
name: ‚codecentric Frankfurt’,
loc: [ 50.11678, 8.67206]
}
// Index the 2D coordinates
> db.locations.ensureIndex( { loc : '2d' } )

// Find locations near codecentric Frankfurt
> db.locations.find({
loc: { $near: [ 50.1, 8.7 ] }
})

TTL Collections
// Documents need a field of type BSON UTC
{ ' submitted_date ' : ISODate('2012-10-12T05:24:07.211Z'), … }

// Documents will be deleted automatically by a daemon process
// after 'expireAfterSeconds'
{ submitted_date: 1 },
{ expireAfterSeconds: 3600 }
)

Limitations of indices
•

Collections can‘t have more than 64 indices

•

Index keys are not allowed to be larger than 1024 Byte

•

The name of an index (including name space) must be
less than 128 character

•

Queries can only make use of one index
– Exception: Queries using $or

•

Indices are tried to be kept in-memory

•

Indices slow down the writing of data

Best practice
1. Identify slow queries
2. Find out more about the slow queries

using explain()
3. Create appropriate indices on the fields

being queried
4. Optimize the query taking the

available indices into account

1. Identify slow queries
> db.setProfilingLevel( n , slowms=100ms )

n=0: Profiler off
n=1: Log all operations slower than slowms
n=2: Log all operations

> db.system.profile.find()

* The collection profile is a capped collection with a limited number of
entries

2. Usage of explain()
> db.recipes.find( { calories:
{ $lt : 40 } }
).explain( )
{
"cursor" : "BasicCursor" ,
"n" : 42,
"nscannedObjects” : 53641
"nscanned" : 53641,
...
"millis" : 252,
...
}

2. Metrics of the execution plan I
• Cursor
– The type of the cursor: BasicCursor means no idex

has been used

• n
– The number of matched documents

• nscannedObjects
– The number of scanned documents

• nscanned
– The number of scanned entries (Index entries or

documents)

2. Metrics of the execution plan II
• millis
– Execution time of the query

• Complete reference can be found here
– http://docs.mongodb.org/manual/reference/explain

Optimize for

ℎ

=1

3. Create appropriate indices
on the fields being queried

4. Optimize queries taking the
// Using the following index…
> db.collection.ensureIndex({ a:1, b:1 , c:1, d:1 })
// … these queries and sorts can make use of the index
> db.collection.find( ).sort({ a:1 })
> db.collection.find( ).sort({ a:1, b:1 })
> db.collection.find({ a:4 }).sort({ a:1, b:1 })
> db.collection.find({ b:5 }).sort({ a:1, b:1 })

> db.collection.ensureIndex({ a:1, b:1, c:1, d:1 })

// … the these queries can not make use of it
> db.collection.find( ).sort({ b: 1 })
> db.collection.find({ b: 5 }).sort({ b: 1 })

> db.recipes.ensureIndex({ main_ingredient: 1, name: 1 })
// … this query can be complete satisfied using the index!
> db.recipes.find(
{ main_ingredient: 'chicken’ },
{ _id: 0, name: 1 }
)
// The metric indexOnly using explain() verifies this:
> db.recipes.find(
{ main_ingredient: 'chicken' },
{ _id: 0, name: 1 }
).explain()
{
"indexOnly": true,
}

Use specific indices
// Tell MongoDB explicitly which index to use
> db.recipes.find({
calories: { $lt: 1000 } }
).hint({ _id: 1 })

// Switch the usage of idices completely off (e.g. for performance
// measurements)
> db.recipes.find(
{ calories: { $lt: 1000 } }
).hint({ $natural: 1 })

Using multiple indices
// MongoDB can only use one index per query!
> db.collection.ensureIndex({ a: 1 })
> db.collection.ensureIndex({ b: 1 })

// For this query only one of those two indices can be used
> db.collection.find({ a: 3, b: 4 })

Compound indices
// Compound indices are often very efficient!
> db.collection.ensureIndex({ a: 1, b: 1, c: 1 })

// But only if the query is a prefix of the index...

// This query can make use of the index
db.collection.find({ c: 2 })

// …but this query can
db.collection.find({ a: 3, b: 5 })

Indices with low selectivity
// The following field has only few distinct values
> db.collection.distinct('status’)
[ 'new', 'processed' ]
// A index on this field is not the best idea…
> db.collection.ensureIndex({ status: 1 })
> db.collection.find({ status: 'new' })
// Better use a adequate compound index with other fields
> db.collection.ensureIndex({ status: 1, created_at: -1 })
> db.collection.find(
{ status: 'new' }
).sort({ created_at: -1 })

Regular expressions & Indices
> db.users.ensureIndex({ username: 1 })

// Left-bound regular expressions can make usage of this index
> db.users.find({ username: /^joe smith/ })

// But not queries with regular expressions in general…
> db.users.find({username: /smith/ })

// Also not case-insensitive queries…
> db.users.find({ username: /^Joe/i })

Negations & Indices
// Negations can not make use of indices
> db.things.ensureIndex({ x: 1 })
// e.g. queries using not equal
> db.things.find({ x: { $ne: 3 } })
// …or queries with not in
> db.things.find({ x: { $nin: [2, 3, 4 ] } })
// …or queries with the $not operator
> db.people.find({ name: { $not: 'John Doe' } })

Lab time!

Lab Nr. 03
Time box:
20 min

What is Map/Reduce?
•

Programming model coming from
functional languages

•

Framework for
– parallel processing
– of big volume data
– using distributed systems

•

Made popular by Google
– Has been invented to calculate the inverted search

index for web sites to keywords (Page Rank)
– http://research.google.com/archive/mapreduce.html

Basics
•

Not something special about MongoDB
–
–
–
–

Hadoop
Disco
Amazon Elastic MapReduce
…

•

Based on key-value-pairs

•

Prior to version 2.4 and the introduction of
the V8 JavaScript engine only one thread
per shard

The „Hello world“ of
Map/Reduce: Word Count

Word Count: Problem
INPUT
{
MongoDB
uses
MapReduce
}

{
There is a
map phase
}

{
There is a
reduce
phase
}

MAPPER

GROUP/SORT

REDUCER

OUTPUT

a: 2
is: 2
map: 1

Problem:
How often does
one word appear
in all documents?

mapreduce: 1
mongodb: 1
phase: 2

reduce: 1
there: 2
uses: 1

Word Count: Mapping
INPUT
{
MongoDB
uses
MapReduce
}

{
There is a
map phase
}

{
There is a
reduce
phase
}

MAPPER

GROUP/SORT

(doc1,
“…“)

(mongodb, 1)
(uses, 1)
(mapreduce, 1)

(doc2,
“…“)

(there, 1)
(is, 1)
(a, 1)
(map, 1)
(phase, 1)

(doc3,
“…“)

(there, 1)
(is, 1)
(a, 1)
(reduce, 1)
(phase, 1)

REDUCER

OUTPUT

Word Count: Group/Sort
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

a-l
(doc1,
“…“)

m-q
{
There is a
map phase
}

{
There is a
reduce
phase
}

(doc2,
“…“)

(map, 1)
(phase, 1)

r-z
(doc3,
“…“)

(there, 1)
(reduce, 1)

OUTPUT

Word Count: Reduce
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

(doc1,
“…“)

(a, [1, 1])
(is, [1, 1])
(map, [1])

{
There is a
map phase
}

(doc2,
“…“)

(mapreduce, [1])
(mongodb, [1])
(phase, [1, 1])

{
There is a
reduce
phase
}

(doc3,
“…“)

(reduce, [1])
(there, [1, 1])
(uses, [1])

OUTPUT

Word Count: Result
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

OUTPUT

(doc1,
“…“)

(a, [1, 1])
(is, [1, 1])
(map, [1])

a: 2
is: 2
map: 1

{
There is a
map phase
}

(doc2,
“…“)

(mapreduce, [1])
(mongodb, [1])
(phase, [1, 1])

mapreduce: 1
mongodb: 1
phase: 2

{
There is a
reduce
phase
}

(doc3,
“…“)

(reduce, [1])
(there, [1, 1])
(uses, [1])

reduce: 1
there: 2
uses: 1

Word Count: In a nutshell
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

(doc1,
“…“)

REDUCER

(a, [1, 1])
(is, [1, 1])
(map, [1])

OUTPUT

a: 2
is: 2
map: 1

map()

reduce()

Transforms one keyvalue-pair in 0–N keyvalue-pairs

Reduces 0-N keyvalue-pairs into one
key-value-pair

Map/Reduce: Overview
MongoDB

Data

group(k)

map()
emit(k,v)

Shard 1

Iterates all
documents

sort(k)

Shard 2

…
Shard n

reduce(k, values)

finalize(k, v)
•
•

Input = Output
Can run multiple
times

Word Count: Tweets
// Example: Twitter database with tweets
> db.tweets.findOne()
{
"_id" : ObjectId("4fb9fb91d066d657de8d6f38"),
"text" : "RT @RevRunWisdom: The bravest thing that men do is
love women #love",
"created_at" : "Thu Sep 02 18:11:24 +0000 2010",
…
"user" : {
"friends_count" : 0,
"profile_sidebar_fill_color" : "252429",
"screen_name" : "RevRunWisdom",
"name" : "Rev Run",
},
…

Word Count: map()
// Map function with simple data cleansing
map = function() {
this.text.split(' ').forEach(function(word) {
// Remove whitespace
word = word.replace(/s/g, "");
// Remove all non-word-characters
word = word.replace(/W/gm,"");
// Finally emit the cleaned up word
if(word != "") {
emit(word, 1)
}
});
};

Word Count: reduc()
// Reduce function
reduce = function(key, values) {
return values.length;
};

Word Count: Call
// Show the results using the console
> db.tweets.mapReduce(map, reduce, { out : { inline : 1 } } );
// Save the results to a collection
> db.tweets.mapReduce(map, reduce, { out : "tweets_word_count"} );
{
"result" : "tweets_word_count",
"timeMillis" : 19026,
"counts" : {
"input" : 53641,
"emit" : 559217,
"reduce" : 102057,
"output" : 131003
},
"ok" : 1,
}

Word Count: Result
// Top-10 of most common words in tweets
> db.tweets_word_count.find().sort({"value" : -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"Miley", "value" : 31 }
"mil", "value" : 31 }
"andthenihitmydougie", "value" : 30 }
"programa", "value" : 30 }
"Live", "value" : 29 }
"Super", "value" : 29 }
"cabelo", "value" : 29 }
"listen", "value" : 29 }
"Call", "value" : 28 }
"DA", "value" : 28 }

Typical use cases
•

Counting, Aggregating & Suming up
– Analyzing log entries & Generating log reports
– Generating an inversed index
– Substitute existing ETL processes

•

Counting unique values
– Counting the number of unique visitors of a website

•

Filtering, Parsing & Validation
– Filtering of user data
– Consolidation of user-generated data

•

Sorting
– Data analysis using complex sorting

Summary
•

The Map/Reduce framework is very
versatile & powerful

•

Is implemented in JavaScript
– Necessity to write own map()- und reduce() functions in JavaScript
– Difficult to debug
– Performance is highly influenced by the JavaScript engine

•

Can be used for complex data analytics

•

Lots of overhead for simple aggregation tasks
– Suming up of data
– Average of data
– Grouping of data

Map/Reduce should be used as
ultima ratio!

Lab time!

Lab Nr. 04
Time box:
20 min

Why?
SELECT customer_id, SUM(price)
FROM orders
WHERE active=true
GROUP BY customer_id

That‘s why!
SELECT customer_id, SUM(price)
FROM orders
Calculation
WHERE active=true
of fields
GROUP BY customer_id
Grouping
of data

The Aggregation Framework
Has been introduced to allow 90% of realworld aggregation use cases without using
the „big hammer“ Map/Reduce
• Framework of methods & operators
•

– Declarative
– No own JavaScript code needed
– Fixed set of methods and operators (but constantly under

development by MongoDB Inc.)

•

Implemented in C++
– Limitations on JavaScript Engine are avoided
– Better performance

The Aggregation Pipeline

Pipeline
Operator
Pipeline
Operator
Pipeline
Operator

{
document
}

Result
{
sum: 337
avg: 24,53
min: 2
max : 99
}

The Aggregation Pipeline
•

Processes a stream of documents
– Input is a complete collection
– Output is a document containing the results

•

Succession of pipeline operators
– Each tier filters or transforms the documents
– Input documents of a tier are the output documents

of the previous tier

Call
db.tweets.aggregate(
{ $pipeline_operator_1
...
);

},
},
},
},

Pipeline Operators
// Old friends*
$match
$sort
$limit
$skip
* from the query functionality

// New friends
$project
$group
$unwind

Example: Tweets
// Example: Twitter database with tweets
> db.tweets.findOne()
{
"_id" : ObjectId("4fb9fb91d066d657de8d6f38"),
"text" : "RT @RevRunWisdom: The bravest thing that men do is
love women #love",
"created_at" : "Thu Sep 02 18:11:24 +0000 2010",
…
"user" : {
"friends_count" : 0,
"profile_sidebar_fill_color" : "252429",
"screen_name" : "RevRunWisdom",
"name" : "Rev Run",
},
…

$match
// Show all german users
> db.tweets.aggregate(
{ $match : {"user.lang" : "de"}},
);
// Show all users with 0 to 10 followers
{ $match : {"user.followers_count" : { $gte : 0, $lt : 10 } } }
);

> Filters documents
> Equivalent to .find()

$sort
// Sorting using one field
{ $sort : {"user.friends_count" : -1} },
);
// Sorting using multiple fields
{ $sort : {"user.lang" : 1, "user.time_zone" : 1,
"user.friends_count" : -1} },
);

> Sorts documents
> Equivalent to .sort()

$limit
// Limit the number of resulting documents to 3
{ $limit : 3 }
);

> Limits resulting documents
> Equivalent to .limit()

$skip
// Get the No.4-Twitterer according to number of friends
{ $skip : 3 },
{ $limit : 1 }
);

> Skips documents
> Equivalent to .skip()

$project I
// Limit the result document to only one field
{ $project : {text : 1} },
);
// Remove _id
{ $project : {_id: 0, text : 1} },
);

> Limits the fields in
resulting documents

$project II
// Rename a field
{ $project : {_id: 0, content_of_tweet : "$text"} },
);
// Add a calculated field
{ $project : {_id: 0, content_of_tweet : "$text", number_of_friends :
{$add: ["$user.friends_count", 10]} } },
);

$project III
// Add a subdocument
{ $project : {_id: 0,
content_of_tweet : "$text",
user : {
name : "$user.name",
number_of_friends : {$add: ["$user.friends_count", 10]}
}
} } );

$group I
// Grouping using a single field
{ $group : {
_id : "$user.lang",
anzahl_tweets : {$sum : 1} }
}
);

> Groups documents
> Equivalent to GROUP BY in SQL

$group II
// Grouping using multiple fields
{ $group : {
_id : { background_image:
"$user.profile_use_background_image",
language: "$user.lang" },
number_of_tweets: {$max : 1} }
}
);

$group III
// Grouping with multiple calculated fields
{ $group : {
_id : "$user.lang",
number_of_tweets : {$sum : 1},
average_of_followers : {$avg : "$user.followers_count"},
minimum_of_followers : {$min : "$user.followers_count"},
maximum_of_followers : {$max : "$user.followers_count"} }
}
);

Group Aggregation Functions

$min
$max
$avg
$sum

$addToSet
$first
$last
$push

$unwind I
// Unwind an array
{ $project : {_id: 0, content_of_tweet : "$text",
mentioned_users : "$entities.user_mentions.name" } },
{ $skip : 18 },
{ $limit : 1 },
{ $unwind : "$mentioned_users" }
);

> Unwinds arrays and
creates one document per
value in the array

$unwind II
// Resulting document without $unwind
{
„content_of_tweet" : "RT @Philanthropy: How should
nonprofit groups measure their social-media efforts? A
new podcast from @afine http://ht.ly/2yFlS",
„mentioned_users" : [
"Philanthropy",
"Allison Fine"
]
}

$unwind III
// Resulting documents with $unwind
{
" content_of_tweet " : "RT @Philanthropy: How should
" mentioned_users " : "Philanthropy"
},
{
" content_of_tweet " : "RT @Philanthropy: How should
" mentioned_users " : "Allison Fine"
}

Place $match at the beginning of
the pipeline to reduce the
number of documents as soon as
possible!

Best Practice #1

Use $project to remove not
needed fields in the documents
as soon as possible!

Best Practice #2

When being placed at the beginning of the pipeline these
operators can make use of indices:

$match
$sort
$limit
$skip
The above operators can equally use indices when placed
before these operators:

$project
$unwind
$group

Best Practice #3

Mapping
SQL

MongoDB Aggregation

WHERE

$match

GROUP BY

$group

HAVING

$match

SELECT

$project

ORDER BY

$sort

LIMIT

$limit

SUM()

$sum

COUNT()

$sum

join

No equivalent operator
($unwind has somehow equivalent
functionality for embedded fields)

Example: Online shopping
{
cust_id: “sheldon1",
ord_date:
ISODate("2013-04-018T19:38:11.102Z"),
status: ‘purchased',
price: 105,69,
items:
[ { sku: “nobel_price_replica",
qty: 3, price: 29,90 },
{ sku: “wheaton_voodoo_doll",
qty: 1, price: 15,99 } ]
}

Count all orders
SQL

MongoDB Aggregation

SELECT COUNT(*) AS
count FROM orders

db.orders.aggregate( [ {
$group: { _id: null,
count: { $sum: 1 } }
}])

Average order price per customer
SQL

MongoDB Aggregation

SELECT cust_id, SUM(price)
AS total FROM orders
GROUP BY cust_id ORDER
BY total

db.orders.aggregate( [ {
$group: { _id: "$cust_id",
total: { $sum: "$price" } } },
{ $sort: { total: 1 }
}])

Sum up all orders over 250$
SQL

MongoDB Aggregation

SELECT cust_id, SUM(price) as db.orders.aggregate( [ {
$match: { status: 'A' } },
total
{ $group: { _id: "$cust_id",
FROM orders
WHERE status = ‘purchased'
total: { $sum: "$price" } } },
GROUP BY cust_id
{ $match: { total: { $gt: 250
HAVING total > 250
}}}])

More examples
http://docs.mongodb.org/manual
/reference/sql-aggregationcomparison/

Lab time!

Lab Nr. 05
Time box:
20 min

Replication: High
Availability with MongoDB

Why do we need replication?
•

Hardware is unreliable and is doomed to
fail!

•

Do you want to be the person being called
at night to do a manual failover?

•

How about network latency?

•

Different use cases for your data
– “Regular” processing
– Data for analysis
– Data for backup

Replica set – Back to normal

Configuration I
> conf = {
_id : "mySet",
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority : 2},
{_id : 2, host : "C”},
{_id : 3, host : "D", hidden : true},
{_id : 4, host : "E", hidden : true, slaveDelay : 3600}
]
}

> rs.initiate(conf)

Configuration II
> conf = {
_id : "mySet”,
members : [

Primary data center

{_id : 2, host : "C”},
]
}

> rs.initiate(conf)

Configuration III
> conf = {
_id : "mySet”,
members : [

Secondary data center
(Default priority = 1)

{_id : 2, host : "C”},
]
}

> rs.initiate(conf)

Configuration IV
> conf = {
_id : "mySet”,
members : [

Analytical data e.g. for
Hadoop, Storm, BI, …

{_id : 2, host : "C”},
]
}

> rs.initiate(conf)

Configuration V
> conf = {
_id : "mySet”,
members : [
{_id : 2, host : "C”},
]
}

> rs.initiate(conf)

Back-up node

Write Concern
• Different levels of data consistency
• Acknowledged by
– Network
– MongoDB
– Journal
– Secondaries
– Tagging

Acknowledged by network
„Fire and forget“

Acknowledged by MongoDB
Wait for Error

Acknowledged by Journal
Wait for Journal Sync

Acknowledged by Secondaries
Wait for Replication

Tagging while writing data
•

Available since 2.0

•

Allows for fine granular control

•

Each node can have multiple tags
– tags: {dc: "ny"}
– tags: {dc: "ny", subnet: „192.168", rack: „row3rk7"}

•

Allows for creating Write Concern Rules (per
replica set)

•

Tags can be adapted without code changes
and restarts

Tagging - Example
{
_id : "mySet",
members : [
{_id : 0, host : "A", tags : {"dc": "ny"}},
{_id : 1, host : "B", tags : {"dc": "ny"}},
{_id : 2, host : "C", tags : {"dc": "sf"}},
{_id : 3, host : "D", tags : {"dc": "sf"}},
{_id : 4, host : "E", tags : {"dc": "cloud"}}],
settings : {
getLastErrorModes : {
allDCs : {"dc" : 3},
someDCs : {"dc" : 2}} }
}
> db.blogs.insert({...})
> db.runCommand({getLastError : 1, w : "someDCs"})

Acknowledged by Tagging
Wait for Replication (Tagging)

Configure the Write Concern
// Wait for network acknowledgement
> db.runCommand( { getLastError: 1, w: 0 } )
// Wait for error (Default)
> db.runCommand( { getLastError: 1, w: 1 } )
// Wait for journal sync
> db.runCommand( { getLastError: 1, w: 1, j: "true" } )
// Wait for replication
> db.runCommand( { getLastError: 1, w: “majority" } )
> db.runCommand( { getLastError: 1, w: 3 } ) // # of secondaries

Read Concerns
•

Only primary

(primary)

•

Primary preferred

(primaryPreferred)

•

Only secondaries

(secondary)

•

Secondaries preferred

(secondaryPreferred)

•

Nearest node

(Nearest)

General: If more than one node is available, the
nearest node will be chosen (All modes except
Primary)

Read
Read

Primary preferred
(primaryPreferred)

Read

Read

Only secondaries
(secondary)

Read

Read
Read

Secondaries preferred
(secondaryPreferred)

Read

Read
Read

Nearest node
(nearest)

Tagging while reading data
•

Allows for a more fine granular control
where data will be read from
– e.g. { "disk": "ssd", "use": "reporting" }

•

Can be combined with other read modes
– Except for mode „Only primary“

Configure the Read Concern
// Only primary
> cursor.setReadPref( “primary" )
// Primary preferred
> cursor.setReadPref( “primaryPreferred" )
…
// Only secondaries with tagging
> cursor.setReadPref( “secondary“, [ rack : 2 ] )

Read Concern must be configured
before using the cursor to read data!

Maintenance & Upgrades
•

Zero downtime

•

Rolling upgrades and maintenance
–
–
–
–

•

Start with all secondaries
Step down the current primary
Primary as last one
Restore previous primary (if needed)

Commands:
– rs.stepDown(<secs>)
– db.version()
– db.serverBuildInfo()

Replica set – 1 data center
•

One
– Data center
– Switch
– Power Supply

•

Possible errors:
– Failure of 2 nodes
– Power Supply
– Network
– Data Center

•

Automatic recovery

•

Additional node for
data recovery

•

No writing to both
data center since
only one node in
data center No. 2

•

Can recover from a
complete data center
failure

•

Allows for usage of
w= { dc : 2 } to
guarantee writing to
2 data centers (via
tagging)

Commands
•

Administration of the nodes
–
–
–
–
–

•

rs.conf()
rs.initiate(<conf>) & rs.reconfig(<conf>)
rs.add(host:<port>) & rs.addArb(host:<port>)
rs.status()
rs.stepDown(<secs>)

Reconfiguration if a minority of the nodes
is not available
– rs.reconfig( cfg, { force: true} )

Best Practices
•

Uneven number of nodes

•

Adapt the write concern to your use case

•

Read from primary except for
– Geographical distribution
– Data analytics

•

Use logical names and not IP addresses for
configuration

•

Monitor the lags of the secondaries (e.g.
MMS)

Lab time!

Lab Nr. 06
Time box:
20 min

Sharding: Scaling with
MongoDB

Visual representation of vertical scaling

1970 - 2000: Vertical Scaling
„Scale up“

Visual representation of horizontal scaling

Since 2000: Horizontal Scaling
„Scale out“

The working set doesn‘t fit
into the memory

The needs for read-/write throughput
are higher than the I/O capabilities

Partitioning of data
•

The user needs to define a shard key

•

The shard key defines the distribution of
data across the shards

Partitioning of data into chunks
•

Initially all data is in one chunk

•

Maximum chunk size: 64 MB

•

MongoDB divides and distributes chunks
automatically once the maximum size is
met

One chunk contains data of a
certain value range

Chunks & Shards
•

A shard is one node in the cluster

•

A shard can be one single mongod or a
replica set

Metadata Management
•

Config Server
– Stores the value ranges of the chunks and their

location
– Number of config servers is 1 or 3 (Production: 3)
– Two Phase Commit

Balancing & Routing Service
•

mongos balances the data
in the cluster

•

mongos distributes data to
new nodes

•

mongos routes queries to
the correct shard or
collects results if data is
spread on multiple shards

•

No local data

Automatic Balancing

Balancing will be automatically done once
the number of chunks between shards hits a
certain threshold

Splitting of a chunk

•

Once a chunk hits the maximum size it will be split

•

Splitting is only a logical operation, no data needs to
be moved

•

If the splitting of a chunk results in a misbalance of
data, automatic rebalancing will be started

MongoDB Auto Sharding
•

Minimal effort
– Usage of the same interfaces for mongod and

mongos

•

Easy configuration
– Enable sharding for a database

• sh.enableSharding("<database>")
– Shard a collection in a database

• sh.shardCollection("<database>.<collection>",
shard-key-pattern)

Example of a very simple cluster

•

Never use this in production!
– Only one config server (No fault tolerance)
– Shard is no replica set (No high availability)
– Only one mongos and one shard (No performance

improvement)

Start the config server

// Start the config server (Default port 27019)
> mongod --configsvr

Start the mongos routing service

// Start the mongos router (Default port 27017)
> mongos --configdb <hostname>:27019

// When using 3 config servers
> mongos --configdb <host1>:<port1>,<host2>:<port2>,<host3>:<port3>

Start the shard

// Start a shard with one mongod (Default port 27018)
> mongod --shardsvr

// Shard is not yet added to the cluster!

Add the shard

// Connect to mongos and add the shard
> mongo
> sh.addShard(‘<host>:27018’)
// When adding a replica set, you only need to add one of the nodes!

Check configuration

// Check if the shard has been added
> db.runCommand({ listShards:1 })
{ "shards" :
[ { "_id”: "shard0000”, "host”: ”<hostname>:27018” } ],
"ok" : 1
}

Configure sharding
// Enable the sharding for a database
> sh.enableSharding(“<dbname>”)

// Shard a collection using a shard key
> sh.shardCollection(“<dbname>.user”, { “name” : 1 } )

// Use a compound shard key
> sh.shardCollection(“<dbname>.cars”,{“year”:1, ”uniqueid”:1})

Shard Key
•

The shard key can not be changed

•

The values of a shard key can not be
changed

•

The shard key needs to be indexed

•

The uniqueness of the field _id is only
guaranteed within a shard

•

The size of a shard key is limited to 512
bytes

Considerations for the shard key
•

Cardinality of data
– The value range needs to be rather large. For example sharding

on the field loglevel with the 3 values error, warning, info
doesn‘t make sense.

•

Distribution of data
– Always strive for equal distribution of data throughout all

shards!

•

Patterns during reading and writing
– For example for log data using the timestamp as a shard key

can be useful if chronological very close data needs to be read
or written together.

Choices for the shard key
•

Single field
– If the value range is big enough and data is distributed almost

equally

•

Compound fields
– Use this if a single field is not enough in respect to value range

and equal distribution

•

Hash based
– In general a random shard key is a good choice for equal

distribution of data
– For performance the shard key should be part of the queries
– Only available since 2.4
• sh.shardCollection( “user.name", { a: "hashed" } )

Example: User
{
_id: 346,
username: “sheldinator”,
password: “238b8be8bd133b86d1e2ba191a94f549”,
first_name: “Sheldon”
last_name: “Cooper”
created_on: “Mon Apr 15 15:30:32 +0000 2013“
modified_on: “Thu Apr 18 08:11:23 +0000 2013“
}

Which shard key would
you choose and why?

Example: Log data
{
log_type: “error”

// Possible values “error, “warn”, “info“

application: “JBoss v. 4.2.3”
message: “Fatal error. Application will quit.”
created_on: “Mon Apr 15 15:38:05 +0000 2013“
}

Which shard key would
you choose and why?

Possible types of queries
•

Exact queries
– Data is exactly on one shard

•

Distributed query
– Data is distributed on different shards

•

Distributed query with sorting
– Data is distributed on different shards and needs to

be sorted

1. mongos receives the query
from the client

2. Query is routed to the shard
with the data

4. mongos returns the data to
the client

2. mongos routes the query to
all shards

Distributed queries with sorting

3. Execute the query and local
sorting

5. mongos sorts the data
globally

6. mongos returns the sorted
data to the client

Lab time!

Lab Nr. 07
Time box:
20 min

Still want moar?

https://education.mongodb.com

MongoDB for Coder Training (Coding Serbia 2013)

In this document

More Related Content

What's hot

Viewers also liked

Similar to MongoDB for Coder Training (Coding Serbia 2013)

More from Uwe Printz

Recently uploaded

MongoDB for Coder Training (Coding Serbia 2013)