The document provides an extensive overview of MongoDB, covering its functionality, advantages, and core concepts such as data manipulation, indexing, and scalability. It discusses the architecture of MongoDB, the differences between ACID and BASE consistency models, and various operations like creating, reading, updating, and deleting documents across different scenarios. Additionally, it touches on performance optimization through indexing and highlights the importance of managing indices for querying efficiency.
Introduction to the training session on MongoDB, covering agenda items including NoSQL, CRUD operations, indexing, and data aggregation.
Explaining NoSQL classifications (Key-Value, Document, Column, Graph) and Big Data’s 3 V’s - Volume, Velocity, Variety.
Discussion on scaling strategies: vertical scaling (RAM, CPU, storage) and horizontal scaling, emphasizing how they differ and their operational implications.
Introduces CAP Theorem components—Consistency, Availability, Partition Tolerance—highlighting trade-offs in distributed systems and ACID vs BASE models.
Overview of MongoDB: document database, performance, flexible schema, scalability, high availability, with examples of data structure.
Overview of MongoDB architecture details and query capabilities, including rich query language and aggregation framework.
Describes how to manipulate data in MongoDB: terminology comparisons, basic commands to create collections and databases.
Details on updating and deleting documents in MongoDB, including modifications to documents and document arrays.
Introduction to indexing concepts, creation of indices, types of indices (unique, sparse, geospatial), and how they improve query performance.
Explains Map/Reduce as a programming model for processing big data, highlighting its operation through the word count example.
Introduction to the Aggregation Framework for efficient data processing, highlighting features and the pipeline concept.
Detailed analysis of pipeline operators such as $match, $sort, $project, and $group for effective data aggregation.
Discusses the importance of data replication in MongoDB for high availability and outlines the processes of setting up replica sets.
Focuses on data consistency levels in replication, different write concerns, and mechanisms for ensuring data reliability in replicas.
Explains sharding as a method for processing large datasets by partitioning data across multiple servers to enhance performance.
Details on shard key selection, query routing in sharding, and considerations for managing data distribution across shards.
Describes handling both exact and distributed queries in a sharded environment and the mechanisms to manage them effectively.
Encouragement to further engage with MongoDB education resources for deeper knowledge.
About me
Big DataNerd
Hadoop Trainer MongoDB Author
Photography Enthusiast
Travelpirate
3.
About us
is abunch of…
Big Data Nerds
Agile Ninjas
Continuous Delivery Gurus
Join us!
Enterprise Java Specialists Performance Geeks
4.
Agenda I
1. Introductionto NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD
with MongoDB
3. Indexing: Speed up your queries with
MongoDB
4. MapReduce: Data aggregation with
MongoDB
5.
Agenda
5. Aggregation Framework:Data
aggregation done the MongoDB way
6. Replication: High Availability with
MongoDB
7. Sharding: Scaling with MongoDB
The CAP Theorem
Availability
aguarantee
that every
request
receives a
response
Consistency
all nodes see
the same data
at the same
time
Partition
Tolerance
failure of
single nodes
doesn‘t effect
the overall
system
25.
Overview of NoSQLsystems
Availability
a guarantee
that every
request
receives a
response
C
Partition
onsistency
Tolerance
all nodes see
the same data
at the same
time
failure of
single nodes
doesn‘t effect
the overall
system
ACID vs. BASE
ACID
BASE
-
-
Strongconsistency
Isolation & Transactions
Two-Phase-Commit
Complex Development
More reliable
Eventual consistency
Highly Available
"Fire-and-forget"
Eases development
Faster
Open Source Database
•
MongoDBis a open source project
•
Available on GitHub
– https://github.com/mongodb/mongo
•
Uses the AGPL Lizenz
•
Started and sponsored by MongoDB Inc.
(prior: 10gen)
•
Commercial version and support available
•
Join the crowd!
– https://jira.mongodb.org
Scalability
Auto Sharding
• Increasecapacity as you go
• Commodity and cloud architectures
• Improved operational simplicity and cost visibility
39.
High Availability
• Automatedreplication and failover
• Multi-data center support
• Improved operational simplicity (e.g., HW swaps)
• Data durability and consistency
Driver & Shell
Driversare available
for almost all popular
programming
languages and
frameworks
Java
JavaScript
Python
Shell to interact with the
database
Ruby
Perl
Haskell
> db.collection.insert({product:“MongoDB”,
type:“Document Database”})
>
> db.collection.findOne()
{
“_id”
: ObjectId(“5106c1c2fc629bfe52792e86”),
“product”
: “MongoDB”
“type”
: “Document Database”
}
46.
NoSQL Trends
Google Search
LinkedInJob Skills
MongoDB
Competitor 1
Competitor 2
Competitor 3
Competitor 4
Competitor 5
MongoDB
Competitor 2
Competitor 1
Competitor 4
Competitor 3
All Others
Jaspersoft Big Data Index
Indeed.com Trends
Top Job Trends
Direct Real-Time Downloads
MongoDB
Competitor 1
Competitor 2
Competitor 3
1.HTML 5
2.MongoDB
3.iOS
4.Android
5.Mobile Apps
6.Puppet
7.Hadoop
8.jQuery
9.PaaS
10.Social Media
Create a database
//Show all databases
> show dbs
digg 0.078125GB
enron 1.49951171875GB
// Switch to a database
> use blog
// Show all databases again
> show dbs
digg 0.078125GB
enron 1.49951171875GB
54.
Create a collectionI
// Show all collections
> show collections
// Insert a user
> db.user.insert(
{ name : “Sheldon“,
mail : “sheldon@bigbang.com“ }
)
No feedback about the result of the insert,
use:
db.runCommand( { getLastError: 1} )
55.
Create a collectionII
// Show all collections
> show collections
system.indexes
user
// Show all databases
> show dbs
blog 0.0625GB
digg 0.078125GB
enron 1.49951171875GB
Databases and collections are
automatically created during
the first insert operation!
56.
Read from acollection
// Show the first document
> db.user.findOne()
{
"_id" : ObjectId("516684a32f391f3c2fcb80ed"),
"name" : "Sheldon",
"mail" : "sheldon@bigbang.com"
}
// Show all documents of a collection
> db.user.find()
{
"_id" : ObjectId("516684a32f391f3c2fcb80ed"),
"name" : "Sheldon",
"mail" : "sheldon@bigbang.com"
}
57.
Find documents
// Finda specific document
> db.user.find( { name : ”Penny” } )
{
"_id" : ObjectId("5166a9dc2f391f3c2fcb80f1"),
"name" : "Penny",
"mail" : "penny@bigbang.com"
}
// Show only certain fields of the document
> db.user.find( { name : ”Penny” },
{_id: 0, mail : 1} )
{ "mail" : "sheldon@bigbang.com" }
58.
_id
•
_id is theprimary key in MongoDB
•
_id is created automatically
•
If not specified differently, it‘s type is
ObjectId
•
_id can be specified by the user during the
insert of documents, but needs to be
unique (and can not be edited afterwards)
59.
ObjectId
•
A ObjectId isa special 12 Byte value
•
It‘s uniqueness in the whole cluster is
guaranteed as following:
ObjectId("50804d0bd94ccab2da652599")
|-------------||---------||-----||----------|
ts
mac pid inc
60.
Cursor
// Use acursor with find()
> var myCursor = db.user.find( )
// Get the next document
> var myDocument =
myCursor.hasNext() ? myCursor.next() : null;
> if (myDocument) { printjson(myDocument.mail); }
// Show all other documents
> myCursor.forEach(printjson);
By default the shell displays
20 documents
61.
Logical operators
// Finddocuments using OR
> db.user.find(
{$or : [ { name : “Sheldon“ },
{ mail : amy@bigbang.com }
]
})
// Find documents using AND
> db.user.find(
{$and : [ { name : “Sheldon“ },
{ mail : amy@bigbang.com }
]
})
62.
Manipulating results
// Sortdocuments
> db.user.find().sort( { name : 1 } ) // Aufsteigend
> db.user.find().sort( { name : -1 } ) // Absteigend
// Limit the number of documents
> db.user.find().limit(3)
// Skip documents
> db.user.find().skip(2)
// Combination of both methods
> db.user.find().skip(2).limit(3)
63.
Updating documents I
//Updating only the mail address (How not to do…)
> db.user.update( { name : “Sheldon“ },
{ mail : “sheldon@howimetyourmother.com“ }
)
// Result of the update operation
db.user.findOne()
{
"_id" : ObjectId("516684a32f391f3c2fcb80ed"),
"mail" : "sheldon@howimetyourmother.com"
}
Be careful when updating
documents!
64.
Deleting documents
// Deletinga document
> db.user.remove(
{ mail : “sheldon@howimetyourmother.com“ }
)
// Deleting all documents in a collection
> db.user.remove()
// Use a condition to delete documents
> db.user.remove(
{ mail : /.*mother.com$/ } )
// Delete only the first document using a condition
> db.user.remove( { mail : /.*.com$/ }, true )
65.
Updating documents II
//Updating only the mail address (This time for real)
> db.user.update( { name : “Sheldon“ },
{ $set : {
mail : “sheldon@howimetyourmother.com“
}})
// Show the result of the update operation
db.user.find(name : “Sheldon“)
{
"_id" : ObjectId("5166ba122f391f3c2fcb80f5"),
"mail" : "sheldon@howimetyourmother.com",
"name" : "Sheldon"
}
66.
Adding to arrays
//Adding a array
> db.user.update( {name : “Sheldon“ },
{ $set : {enemies :
[ { name : “Wil Wheaton“ },
{ name : “Barry Kripke“ }
]
}})
// Adding a value to the array
> db.user.update( { name : “Sheldon“},
{ $push : {enemies :
{ name : “Leslie Winkle“}
}})
67.
Deleting from arrays
//Deleting a value from an array
> db.user.update( { name : “Sheldon“ },
{$pull : {enemies :
{name : “Barry Kripke“ }
}})
// Deleting of a complete array
> db.user.update( {name : “Sheldon“},
{$unset : {enemies : 1}}
)
Querying subdocuments
// Findingout the name of the mother
> db.user.find( { name : “Sheldon“},
{“mother.name“ : 1 } )
{
"_id" : ObjectId("5166cf162f391f3c2fcb80f7"),
"mother" : {
"name" : "Mary Cooper"
}
}
Compound field names need to
be in “…“!
70.
Overview of allupdate operators
For fields:
$inc
$rename
$set
$unset
Bitwise operation:
$bit
Isolation:
$isolated
For arrays:
$addToSet
$pop
$pullAll
$pull
$pushAll
$push
$each (Modifier)
$slice (Modifier)
$sort (Modifier)
How do Icreate an index?
// Create a non-existing index for a field
> db.recipes.createIndex({ main_ingredient: 1 })
// Make sure there is an index on the field
> db.recipes.ensureIndex({ main_ingredient: 1 })
* 1 for ascending, -1 for descending
What can beindexed?
// Subdocuments
{
name : 'Apple Pie',
contributor: {
name: 'Joe American',
id: 'joea123'
}
}
db.recipes.ensureIndex({ 'contributor.id': 1 })
db.recipes.ensureIndex({ 'contributor': 1 })
84.
How to maintainindices?
// List all indices of a collection
> db.recipes.getIndexes()
> db.recipes.getIndexKeys()
// Drop an index
> db.recipes.dropIndex({ ingredients: 1 })
// Drop and recreate all indices of a collection
db.recipes.reIndex()
85.
More options
•
Unique Index
–Allows only unique values in the indexed field(s)
•
Sparse Index
– For fields that are not available in all documents
•
Geospatial Index
– For modelling 2D and 3D geospatial indices
•
TTL Collections
– Are automatically deleted after x seconds
86.
Unique Index
// Makesure the name of a recipe is unique
> db.recipes.ensureIndex( { name: 1 }, { unique: true } )
// Force an index on a collection with non-unique values
// Duplicates will be deleted more or less randomly!
> db.recipes.ensureIndex(
{ name: 1 },
{ unique: true, dropDups: true }
)
* dropDups should be used only with caution!
87.
Sparse Index
// Onlydocuments with the field calories will be indexed
> db.recipes.ensureIndex(
{ calories: -1 },
{ sparse: true }
)
// Combination with unique index is possible
> db.recipes.ensureIndex(
{ name: 1 , calories: -1 },
{ unique: true, sparse: true }
)
* Missing fields will be saved as null in the index!
88.
Geospatial Index
// Addlongitude and altitude
{
name: ‚codecentric Frankfurt’,
loc: [ 50.11678, 8.67206]
}
// Index the 2D coordinates
> db.locations.ensureIndex( { loc : '2d' } )
// Find locations near codecentric Frankfurt
> db.locations.find({
loc: { $near: [ 50.1, 8.7 ] }
})
89.
TTL Collections
// Documentsneed a field of type BSON UTC
{ ' submitted_date ' : ISODate('2012-10-12T05:24:07.211Z'), … }
// Documents will be deleted automatically by a daemon process
// after 'expireAfterSeconds'
> db.recipes.ensureIndex(
{ submitted_date: 1 },
{ expireAfterSeconds: 3600 }
)
90.
Limitations of indices
•
Collectionscan‘t have more than 64 indices
•
Index keys are not allowed to be larger than 1024 Byte
•
The name of an index (including name space) must be
less than 128 character
•
Queries can only make use of one index
– Exception: Queries using $or
•
Indices are tried to be kept in-memory
•
Indices slow down the writing of data
Best practice
1. Identifyslow queries
2. Find out more about the slow queries
using explain()
3. Create appropriate indices on the fields
being queried
4. Optimize the query taking the
available indices into account
93.
1. Identify slowqueries
> db.setProfilingLevel( n , slowms=100ms )
n=0: Profiler off
n=1: Log all operations slower than slowms
n=2: Log all operations
> db.system.profile.find()
* The collection profile is a capped collection with a limited number of
entries
2. Metrics ofthe execution plan I
• Cursor
– The type of the cursor: BasicCursor means no idex
has been used
• n
– The number of matched documents
• nscannedObjects
– The number of scanned documents
• nscanned
– The number of scanned entries (Index entries or
documents)
96.
2. Metrics ofthe execution plan II
• millis
– Execution time of the query
• Complete reference can be found here
– http://docs.mongodb.org/manual/reference/explain
Optimize for
ℎ
=1
4. Optimize queriestaking the
available indices into account
// Using the following index…
> db.collection.ensureIndex({ a:1, b:1 , c:1, d:1 })
// … these queries and sorts can make use of the index
> db.collection.find( ).sort({ a:1 })
> db.collection.find( ).sort({ a:1, b:1 })
> db.collection.find({ a:4 }).sort({ a:1, b:1 })
> db.collection.find({ b:5 }).sort({ a:1, b:1 })
99.
4. Optimize queriestaking the
available indices into account
// Using the following index…
> db.collection.ensureIndex({ a:1, b:1, c:1, d:1 })
// … the these queries can not make use of it
> db.collection.find( ).sort({ b: 1 })
> db.collection.find({ b: 5 }).sort({ b: 1 })
100.
4. Optimize queriestaking the
available indices into account
// Using the following index…
> db.recipes.ensureIndex({ main_ingredient: 1, name: 1 })
// … this query can be complete satisfied using the index!
> db.recipes.find(
{ main_ingredient: 'chicken’ },
{ _id: 0, name: 1 }
)
// The metric indexOnly using explain() verifies this:
> db.recipes.find(
{ main_ingredient: 'chicken' },
{ _id: 0, name: 1 }
).explain()
{
"indexOnly": true,
}
101.
Use specific indices
//Tell MongoDB explicitly which index to use
> db.recipes.find({
calories: { $lt: 1000 } }
).hint({ _id: 1 })
// Switch the usage of idices completely off (e.g. for performance
// measurements)
> db.recipes.find(
{ calories: { $lt: 1000 } }
).hint({ $natural: 1 })
Using multiple indices
//MongoDB can only use one index per query!
> db.collection.ensureIndex({ a: 1 })
> db.collection.ensureIndex({ b: 1 })
// For this query only one of those two indices can be used
> db.collection.find({ a: 3, b: 4 })
104.
Compound indices
// Compoundindices are often very efficient!
> db.collection.ensureIndex({ a: 1, b: 1, c: 1 })
// But only if the query is a prefix of the index...
// This query can make use of the index
db.collection.find({ c: 2 })
// …but this query can
db.collection.find({ a: 3, b: 5 })
105.
Indices with lowselectivity
// The following field has only few distinct values
> db.collection.distinct('status’)
[ 'new', 'processed' ]
// A index on this field is not the best idea…
> db.collection.ensureIndex({ status: 1 })
> db.collection.find({ status: 'new' })
// Better use a adequate compound index with other fields
> db.collection.ensureIndex({ status: 1, created_at: -1 })
> db.collection.find(
{ status: 'new' }
).sort({ created_at: -1 })
106.
Regular expressions &Indices
> db.users.ensureIndex({ username: 1 })
// Left-bound regular expressions can make usage of this index
> db.users.find({ username: /^joe smith/ })
// But not queries with regular expressions in general…
> db.users.find({username: /smith/ })
// Also not case-insensitive queries…
> db.users.find({ username: /^Joe/i })
107.
Negations & Indices
//Negations can not make use of indices
> db.things.ensureIndex({ x: 1 })
// e.g. queries using not equal
> db.things.find({ x: { $ne: 3 } })
// …or queries with not in
> db.things.find({ x: { $nin: [2, 3, 4 ] } })
// …or queries with the $not operator
> db.people.find({ name: { $not: 'John Doe' } })
What is Map/Reduce?
•
Programmingmodel coming from
functional languages
•
Framework for
– parallel processing
– of big volume data
– using distributed systems
•
Made popular by Google
– Has been invented to calculate the inverted search
index for web sites to keywords (Page Rank)
– http://research.google.com/archive/mapreduce.html
111.
Basics
•
Not something specialabout MongoDB
–
–
–
–
Hadoop
Disco
Amazon Elastic MapReduce
…
•
Based on key-value-pairs
•
Prior to version 2.4 and the introduction of
the V8 JavaScript engine only one thread
per shard
Word Count: Problem
INPUT
{
MongoDB
uses
MapReduce
}
{
Thereis a
map phase
}
{
There is a
reduce
phase
}
MAPPER
GROUP/SORT
REDUCER
OUTPUT
a: 2
is: 2
map: 1
Problem:
How often does
one word appear
in all documents?
mapreduce: 1
mongodb: 1
phase: 2
reduce: 1
there: 2
uses: 1
Word Count: Tweets
//Example: Twitter database with tweets
> db.tweets.findOne()
{
"_id" : ObjectId("4fb9fb91d066d657de8d6f38"),
"text" : "RT @RevRunWisdom: The bravest thing that men do is
love women #love",
"created_at" : "Thu Sep 02 18:11:24 +0000 2010",
…
"user" : {
"friends_count" : 0,
"profile_sidebar_fill_color" : "252429",
"screen_name" : "RevRunWisdom",
"name" : "Rev Run",
},
…
121.
Word Count: map()
//Map function with simple data cleansing
map = function() {
this.text.split(' ').forEach(function(word) {
// Remove whitespace
word = word.replace(/s/g, "");
// Remove all non-word-characters
word = word.replace(/W/gm,"");
// Finally emit the cleaned up word
if(word != "") {
emit(word, 1)
}
});
};
122.
Word Count: reduc()
//Reduce function
reduce = function(key, values) {
return values.length;
};
123.
Word Count: Call
//Show the results using the console
> db.tweets.mapReduce(map, reduce, { out : { inline : 1 } } );
// Save the results to a collection
> db.tweets.mapReduce(map, reduce, { out : "tweets_word_count"} );
{
"result" : "tweets_word_count",
"timeMillis" : 19026,
"counts" : {
"input" : 53641,
"emit" : 559217,
"reduce" : 102057,
"output" : 131003
},
"ok" : 1,
}
Typical use cases
•
Counting,Aggregating & Suming up
– Analyzing log entries & Generating log reports
– Generating an inversed index
– Substitute existing ETL processes
•
Counting unique values
– Counting the number of unique visitors of a website
•
Filtering, Parsing & Validation
– Filtering of user data
– Consolidation of user-generated data
•
Sorting
– Data analysis using complex sorting
127.
Summary
•
The Map/Reduce frameworkis very
versatile & powerful
•
Is implemented in JavaScript
– Necessity to write own map()- und reduce() functions in JavaScript
– Difficult to debug
– Performance is highly influenced by the JavaScript engine
•
Can be used for complex data analytics
•
Lots of overhead for simple aggregation tasks
– Suming up of data
– Average of data
– Grouping of data
That‘s why!
SELECT customer_id,SUM(price)
FROM orders
Calculation
WHERE active=true
of fields
GROUP BY customer_id
Grouping
of data
133.
The Aggregation Framework
Hasbeen introduced to allow 90% of realworld aggregation use cases without using
the „big hammer“ Map/Reduce
• Framework of methods & operators
•
– Declarative
– No own JavaScript code needed
– Fixed set of methods and operators (but constantly under
development by MongoDB Inc.)
•
Implemented in C++
– Limitations on JavaScript Engine are avoided
– Better performance
The Aggregation Pipeline
•
Processesa stream of documents
– Input is a complete collection
– Output is a document containing the results
•
Succession of pipeline operators
– Each tier filters or transforms the documents
– Input documents of a tier are the output documents
of the previous tier
$skip
// Get theNo.4-Twitterer according to number of friends
> db.tweets.aggregate(
{ $sort : {"user.friends_count" : -1} },
{ $skip : 3 },
{ $limit : 1 }
);
> Skips documents
> Equivalent to .skip()
143.
$project I
// Limitthe result document to only one field
> db.tweets.aggregate(
{ $project : {text : 1} },
);
// Remove _id
> db.tweets.aggregate(
{ $project : {_id: 0, text : 1} },
);
> Limits the fields in
resulting documents
144.
$project II
// Renamea field
> db.tweets.aggregate(
{ $project : {_id: 0, content_of_tweet : "$text"} },
);
// Add a calculated field
> db.tweets.aggregate(
{ $project : {_id: 0, content_of_tweet : "$text", number_of_friends :
{$add: ["$user.friends_count", 10]} } },
);
145.
$project III
// Adda subdocument
> db.tweets.aggregate(
{ $project : {_id: 0,
content_of_tweet : "$text",
user : {
name : "$user.name",
number_of_friends : {$add: ["$user.friends_count", 10]}
}
} } );
146.
$group I
// Groupingusing a single field
> db.tweets.aggregate(
{ $group : {
_id : "$user.lang",
anzahl_tweets : {$sum : 1} }
}
);
> Groups documents
> Equivalent to GROUP BY in SQL
$unwind I
// Unwindan array
> db.tweets.aggregate(
{ $project : {_id: 0, content_of_tweet : "$text",
mentioned_users : "$entities.user_mentions.name" } },
{ $skip : 18 },
{ $limit : 1 },
{ $unwind : "$mentioned_users" }
);
> Unwinds arrays and
creates one document per
value in the array
151.
$unwind II
// Resultingdocument without $unwind
{
„content_of_tweet" : "RT @Philanthropy: How should
nonprofit groups measure their social-media efforts? A
new podcast from @afine http://ht.ly/2yFlS",
„mentioned_users" : [
"Philanthropy",
"Allison Fine"
]
}
152.
$unwind III
// Resultingdocuments with $unwind
{
" content_of_tweet " : "RT @Philanthropy: How should
nonprofit groups measure their social-media efforts? A
new podcast from @afine http://ht.ly/2yFlS",
" mentioned_users " : "Philanthropy"
},
{
" content_of_tweet " : "RT @Philanthropy: How should
nonprofit groups measure their social-media efforts? A
new podcast from @afine http://ht.ly/2yFlS",
" mentioned_users " : "Allison Fine"
}
Place $match atthe beginning of
the pipeline to reduce the
number of documents as soon as
possible!
Best Practice #1
155.
Use $project toremove not
needed fields in the documents
as soon as possible!
Best Practice #2
156.
When being placedat the beginning of the pipeline these
operators can make use of indices:
$match
$sort
$limit
$skip
The above operators can equally use indices when placed
before these operators:
$project
$unwind
$group
Best Practice #3
Count all orders
SQL
MongoDBAggregation
SELECT COUNT(*) AS
count FROM orders
db.orders.aggregate( [ {
$group: { _id: null,
count: { $sum: 1 } }
}])
161.
Average order priceper customer
SQL
MongoDB Aggregation
SELECT cust_id, SUM(price)
AS total FROM orders
GROUP BY cust_id ORDER
BY total
db.orders.aggregate( [ {
$group: { _id: "$cust_id",
total: { $sum: "$price" } } },
{ $sort: { total: 1 }
}])
162.
Sum up allorders over 250$
SQL
MongoDB Aggregation
SELECT cust_id, SUM(price) as db.orders.aggregate( [ {
$match: { status: 'A' } },
total
{ $group: { _id: "$cust_id",
FROM orders
WHERE status = ‘purchased'
total: { $sum: "$price" } } },
GROUP BY cust_id
{ $match: { total: { $gt: 250
HAVING total > 250
}}}])
Why do weneed replication?
•
Hardware is unreliable and is doomed to
fail!
•
Do you want to be the person being called
at night to do a manual failover?
•
How about network latency?
•
Different use cases for your data
– “Regular” processing
– Data for analysis
– Data for backup
Tagging while writingdata
•
Available since 2.0
•
Allows for fine granular control
•
Each node can have multiple tags
– tags: {dc: "ny"}
– tags: {dc: "ny", subnet: „192.168", rack: „row3rk7"}
•
Allows for creating Write Concern Rules (per
replica set)
•
Tags can be adapted without code changes
and restarts
Read Concerns
•
Only primary
(primary)
•
Primarypreferred
(primaryPreferred)
•
Only secondaries
(secondary)
•
Secondaries preferred
(secondaryPreferred)
•
Nearest node
(Nearest)
General: If more than one node is available, the
nearest node will be chosen (All modes except
Primary)
Tagging while readingdata
•
Allows for a more fine granular control
where data will be read from
– e.g. { "disk": "ssd", "use": "reporting" }
•
Can be combined with other read modes
– Except for mode „Only primary“
200.
Configure the ReadConcern
// Only primary
> cursor.setReadPref( “primary" )
// Primary preferred
> cursor.setReadPref( “primaryPreferred" )
…
// Only secondaries with tagging
> cursor.setReadPref( “secondary“, [ rack : 2 ] )
Read Concern must be configured
before using the cursor to read data!
Maintenance & Upgrades
•
Zerodowntime
•
Rolling upgrades and maintenance
–
–
–
–
•
Start with all secondaries
Step down the current primary
Primary as last one
Restore previous primary (if needed)
Commands:
– rs.stepDown(<secs>)
– db.version()
– db.serverBuildInfo()
203.
Replica set –1 data center
•
One
– Data center
– Switch
– Power Supply
•
Possible errors:
– Failure of 2 nodes
– Power Supply
– Network
– Data Center
•
Automatic recovery
204.
Replica set –2 data center
•
Additional node for
data recovery
•
No writing to both
data center since
only one node in
data center No. 2
205.
Replica set –3 data center
•
Can recover from a
complete data center
failure
•
Allows for usage of
w= { dc : 2 } to
guarantee writing to
2 data centers (via
tagging)
206.
Commands
•
Administration of thenodes
–
–
–
–
–
•
rs.conf()
rs.initiate(<conf>) & rs.reconfig(<conf>)
rs.add(host:<port>) & rs.addArb(host:<port>)
rs.status()
rs.stepDown(<secs>)
Reconfiguration if a minority of the nodes
is not available
– rs.reconfig( cfg, { force: true} )
Best Practices
•
Uneven numberof nodes
•
Adapt the write concern to your use case
•
Read from primary except for
– Geographical distribution
– Data analytics
•
Use logical names and not IP addresses for
configuration
•
Monitor the lags of the secondaries (e.g.
MMS)
Partitioning of data
•
Theuser needs to define a shard key
•
The shard key defines the distribution of
data across the shards
219.
Partitioning of datainto chunks
•
Initially all data is in one chunk
•
Maximum chunk size: 64 MB
•
MongoDB divides and distributes chunks
automatically once the maximum size is
met
Chunks & Shards
•
Ashard is one node in the cluster
•
A shard can be one single mongod or a
replica set
222.
Metadata Management
•
Config Server
–Stores the value ranges of the chunks and their
location
– Number of config servers is 1 or 3 (Production: 3)
– Two Phase Commit
223.
Balancing & RoutingService
•
mongos balances the data
in the cluster
•
mongos distributes data to
new nodes
•
mongos routes queries to
the correct shard or
collects results if data is
spread on multiple shards
•
No local data
Splitting of achunk
•
Once a chunk hits the maximum size it will be split
•
Splitting is only a logical operation, no data needs to
be moved
•
If the splitting of a chunk results in a misbalance of
data, automatic rebalancing will be started
MongoDB Auto Sharding
•
Minimaleffort
– Usage of the same interfaces for mongod and
mongos
•
Easy configuration
– Enable sharding for a database
• sh.enableSharding("<database>")
– Shard a collection in a database
• sh.shardCollection("<database>.<collection>",
shard-key-pattern)
Example of avery simple cluster
•
Never use this in production!
– Only one config server (No fault tolerance)
– Shard is no replica set (No high availability)
– Only one mongos and one shard (No performance
improvement)
230.
Start the configserver
// Start the config server (Default port 27019)
> mongod --configsvr
231.
Start the mongosrouting service
// Start the mongos router (Default port 27017)
> mongos --configdb <hostname>:27019
// When using 3 config servers
> mongos --configdb <host1>:<port1>,<host2>:<port2>,<host3>:<port3>
232.
Start the shard
//Start a shard with one mongod (Default port 27018)
> mongod --shardsvr
// Shard is not yet added to the cluster!
233.
Add the shard
//Connect to mongos and add the shard
> mongo
> sh.addShard(‘<host>:27018’)
// When adding a replica set, you only need to add one of the nodes!
234.
Check configuration
// Checkif the shard has been added
> db.runCommand({ listShards:1 })
{ "shards" :
[ { "_id”: "shard0000”, "host”: ”<hostname>:27018” } ],
"ok" : 1
}
235.
Configure sharding
// Enablethe sharding for a database
> sh.enableSharding(“<dbname>”)
// Shard a collection using a shard key
> sh.shardCollection(“<dbname>.user”, { “name” : 1 } )
// Use a compound shard key
> sh.shardCollection(“<dbname>.cars”,{“year”:1, ”uniqueid”:1})
Shard Key
•
The shardkey can not be changed
•
The values of a shard key can not be
changed
•
The shard key needs to be indexed
•
The uniqueness of the field _id is only
guaranteed within a shard
•
The size of a shard key is limited to 512
bytes
238.
Considerations for theshard key
•
Cardinality of data
– The value range needs to be rather large. For example sharding
on the field loglevel with the 3 values error, warning, info
doesn‘t make sense.
•
Distribution of data
– Always strive for equal distribution of data throughout all
shards!
•
Patterns during reading and writing
– For example for log data using the timestamp as a shard key
can be useful if chronological very close data needs to be read
or written together.
239.
Choices for theshard key
•
Single field
– If the value range is big enough and data is distributed almost
equally
•
Compound fields
– Use this if a single field is not enough in respect to value range
and equal distribution
•
Hash based
– In general a random shard key is a good choice for equal
distribution of data
– For performance the shard key should be part of the queries
– Only available since 2.4
• sh.shardCollection( “user.name", { a: "hashed" } )
240.
Example: User
{
_id: 346,
username:“sheldinator”,
password: “238b8be8bd133b86d1e2ba191a94f549”,
first_name: “Sheldon”
last_name: “Cooper”
created_on: “Mon Apr 15 15:30:32 +0000 2013“
modified_on: “Thu Apr 18 08:11:23 +0000 2013“
}
Which shard key would
you choose and why?
241.
Example: Log data
{
log_type:“error”
// Possible values “error, “warn”, “info“
application: “JBoss v. 4.2.3”
message: “Fatal error. Application will quit.”
created_on: “Mon Apr 15 15:38:05 +0000 2013“
}
Which shard key would
you choose and why?
Possible types ofqueries
•
Exact queries
– Data is exactly on one shard
•
Distributed query
– Data is distributed on different shards
•
Distributed query with sorting
– Data is distributed on different shards and needs to
be sorted