MongoDB Aggregation Framework

Quick Overview of
Document-oriented
Schemaless
JSON-style documents
Rich Queries
Scales Horizontally
db.users.find({
last_name: 'Smith',
age: {$gt : 10}
});
SELECT * FROM users WHERE
last_name=‘Smith’ AND age > 10;

Computing Aggregations in
Databases
SQL-based
RDBMS
JOIN
GROUP BY
AVG(),
COUNT(),
SUM(), FIRST(),
LAST(),
etc.
MongoDB 2.0
MapReduce
MongoDB 2.2+
MapReduce
Aggregation Framework

MapReduce
var map = function()
{
...
emit(key, val);
}
var reduce = function(key, vals)
{
...
return resultVal;
}
Data
Map()
emit(k,v)
Sort(k)
Group(k)
Reduce(k,values)
k,v
Finalize(k,v)
k,v
MongoDB
map iterates on
documents
Document is $this
1 at time per shard
Input matches output
Can run multiple times

What’s wrong with just using
MapReduce?
Map/Reduce is very
powerful, but often overkill
Lots of users relying on it
for simple aggregation tasks
•
•

What’s wrong with just using
MapReduce?
Easy to screw up JavaScript
Debugging a M/R job sucks
Writing more JS for simple tasks should not be necessary
•
•
•
(ಠ︿ಠ)

Aggregation
Framework
Declarative (no need to write JS)
Implemented directly in C++
Expression Evaluation
Return computed values
Framework: We can extend it with new
ops
•
•
•
•
•

Input
Data
(collection)
Filter
Project
Unwind
Group
Sort
Limit
Result
(document)

db.article.aggregate(
{ $project : {author : 1,tags : 1}},
{ $unwind : "$tags" },
{ $group : {_id : “$tags”,
authors:{ $addToSet:"$author"}}
}
);
An aggregation command looks like:

{ $project : {author : 1, tags : 1}},
{ $group : {
_id : “$tags”,
authors : { $addToSet:"$author"}
}}
);
New Helper
Method:
.aggregate()
Operator
pipeline
db.runCommand({
aggregate : "article",
pipeline : [ {$op1, $op2, ...} ]
}

{
"result" : [
{ "_id" : "art", "authors" : [ "bill", "bob" ] },
{ "_id" : "sports", "authors" : [ "jane", "bob" ] },
{ "_id" : "food", "authors" : [ "jane", "bob" ] },
{ "_id" : "science", "authors" : [ "jane", "bill", "bob" ] }
],
"ok" : 1
}
Output Document Looks like this:
result: array of pipeline
output
ok: 1 for success, 0
otherwise

Pipeline
Input to the start of the pipeline is a collection
Series of operators - each one filters or transforms its
input
Passes output data to next operator in the pipeline
Output of the pipeline is the result document
•
•
•
•
ps -ax | tee processes.txt | more
Kind of like UNIX:

Let’s do:
1. Tour of the pipeline
operators
2. A couple examples based on
common SQL aggregation tasks
$match
$unwind
$group
$project
$skip $limit $sort

filters documents from pipeline with a query predicate
filtered with:
{$match: {author:”bob”}}
$match
{author: "bob", pageViews:5, title:"Lorem Ipsum..."}
{author: "bill", pageViews:3, title:"dolor sit amet..."}
{author: "joe", pageViews:52, title:"consectetur adipi..."}
{author: "jane", pageViews:51, title:"sed diam..."}
{author: "bob", pageViews:14, title:"magna aliquam..."}
{author: "bob", pageViews:53, title:"claritas est..."}
filtered with:
{$match: {pageViews:{$gt:50}}
{author:"bob",pageViews:5,title:"Lorem Ipsum..."}
{author:"bob",pageViews:14,title:"magna aliquam..."}
{author:"bob",pageViews:53,title:"claritas est..."}
{author: "joe", pageViews:52, title:"consectetur adipiscing..."}
{author: "jane", pageViews:51, title:"sed diam..."}
{author: "bob", pageViews:53, title:"claritas est..."}
Input:

$unwind
{
"_id" : ObjectId("4f...146"),
"author" : "bob",
"tags" :[ "fun","good","awesome"]
}
explode the “tags” array with:
{ $unwind : ”$tags” }
{ _id : ObjectId("4f...146"), author : "bob", tags:"fun"},
{ _id : ObjectId("4f...146"), author : "bob", tags:"good"},
{ _id : ObjectId("4f...146"), author : "bob", tags:"awesome"}
produces output:
Produce a new document for
each value in an input array

Bucket a subset of docs together,
calculate an aggregated output doc from the bucket
$sum
$max, $min
$avg
$first, $last
$addToSet
$push
{ $group : {
_id : "$author",
viewsPerAuthor : { $sum :
"$pageViews" }
}
}
);
$group
Output
Calculation
Operators:

{ $group : {
_id : "$author",
viewsPerAuthor : { $sum : "$pageViews" }
}
}
);
_id: selects a field to use as
bucket key for grouping
Output field name Operation used to calculate the
output value
($sum, $max, $avg, etc.)
$group (cont’d)
dot notation (nested fields)
a constant
a multi-key expression inside
{...}
•
•
•
also allowed here:

An example with $match and $group
SELECT SUM(price) FROM orders
WHERE customer_id = 4;
MongoDB:
SQL:
db.orders.aggregate(
{$match : {“$customer_id” : 4}},
{$group : { _id : null,
total: {$sum : “price”}})
English: Find the sum of all prices of the
orders placed by customer #4

An example with $unwind and $group
MongoDB:
SQL:
English:
db.posts.aggregate(
{ $group : {
_id : “$tags”,
authors : { $addToSet : "$author" }
}}
);
For all tags used in blog posts, produce a list of
authors that have posted under each tag
SELECT tag, author FROM post_tags LEFT
JOIN posts ON post_tags.post_id =
posts.id GROUP BY tag, author;

More operators - Controlling Pipeline Input
$skip
$limit
$sort
Similar to:
.skip()
.limit()
.sort()
in a regular Mongo query

$sort
specified the same way as index keys:
{ $sort : { name : 1, age: -1 } }
Must be used in order to take
advantage of $first/$last with
$group.
order input documents

$limit
limit the number of input documents
{$limit : 5}
$skip
skips over documents
{$skip : 5}

$project
Use for:
Add, Remove, Pull up, Push down, Rename
Fields
Building computed fields
Reshape a document

$project
(cont’d)
Include or exclude fields
{$project :
{ title : 1,
author : 1} }
Only pass on fields
“title” and “author”
{$project : { comments : 0}
Exclude
“comments” field,
keep everything
else

Moving + Renaming fields
{$project :
{ page_views : “$pageViews”,
catName : “$category.name”,
info : {
published : “$ctime”,
update : “$mtime”
}
}
}
Rename page_views to pageViews
Take nested field
“category.name”, move
it into top-level field
called “catName”
Populate a new
sub-document
into the output
$project
(cont’d)

{ $project : {
name : 1,
age_fixed : { $add:["$age", 2] }
}}
);
Building a Computed Field
Output
(computed field) Operands
Expression
$project
(cont’d)

Lots of Available
Expressions
$project
(cont’d)
Numeric $add $sub $mod $divide $multiply
Logical $eq $lte/$lt $gte/$gt $and $not $or $eq
Dates
$dayOfMonth $dayOfYear $dayOfWeek $second $minute
$hour $week $month $isoDate
Strings $substr $add $toLower $toUpper $strcasecmp

Example: $sort → $limit → $project→
$group
MongoDB:
SQL:
English: Of the most recent 1000 blog posts, how many
were posted within each calendar year?
SELECT YEAR(pub_time) as pub_year,
COUNT(*) FROM
(SELECT pub_time FROM posts ORDER BY
pub_time desc)
GROUP BY pub_year;
db.test.aggregate(
{$sort : {pub_time: -1}},
{$limit : 1000},
{$project:{pub_year:{$year:["$pub_time"]}}},
{$group: {_id:"$pub_year", num_year:{$sum:1}}}
)

Some Usage Notes
In BSON, order matters - so computed
fields always show up after regular fields
We use $ in front of field names to
distinguish fields from string literals
in expressions “$name”
“name”
vs.

Some Usage Notes
Use a $match,$sort and $limit
first in pipeline if possible
Cumulative Operators $group:
be aware of memory usage
Use $project to discard unneeded fields
Remember the 16MB output limit

Aggregation vs.
MapReduce
Framework is geared towards counting/accumulating
If you need something more exotic, use
MapReduce
No 16MB constraint on output size with
MapReduce
JS in M/R is not limited to any fixed set of expressions
•
•
•
•

thanks! ✌(-‿-)✌
questions?
$$$ BTW: we are hiring!
http://10gen.com/jobs $$$
@mpobrien
github.com/mpobrien
hit me up:

MongoDB Aggregation Framework

More Related Content

What's hot

Viewers also liked

Similar to MongoDB Aggregation Framework

More from Caserta

Recently uploaded

In this document

MongoDB Aggregation Framework