The MongoDB Aggregation Pipeline in 5 Minutes

The aggregation pipeline is used to process and aggregate data from collections. Using the aggregation pipeline, you can process multiple documents to return a computed result.

While Mongo also supports map-reduce and single purpose aggregation methods, the aggregation pipeline is the preferred method for data aggregation in MongoDB.

A Quick Example

For example, let's say you have a collection "posts"...

{
   "title" : "my first post",
   "author" : "Jim",
   "likes" : 5
},
{
   "title" : "my second post",
   "author" : "Jim",
   "likes" : 2
},
{
   "title" : "hello world",
   "author" : "Joe",
   "likes" : 3
}

and you want to see the total number of likes for each author...

db.posts.aggregate([
   {$group: {_id: "$author", total_likes: { $sum: "$likes"}}}
])

This aggregation query will return...

{
   "_id" : "Joe",
   "total_likes" : 3
},
{
   "_id" : "Jim",
   "total_likes" : 7
}

Pipeline Stages:

The aggregation pipeline moves documents through different stages. Each stage transforms the documents as they pass through the pipeline.

Our first example defined a single stage using $group. Let's look at an example with multiple stages:

db.posts.aggregate([
   {$match: { author: "Joe"}},
   {$group: {_id: "$author", total_likes: { $sum: "$likes"}}}
])

This query returns

{
   "_id" : "Joe",
   "total_likes" : 3
}

Notice how the aggregate() method takes an array of expressions. Each of these expressions is considered a stage in the pipeline.

The first stage uses $match to filter the documents whose author is Joe.

The second stage uses $group to take the results from the first stage and group them by author name. Notice how the accumulator $sum is used to populate a total_likes field for the computed result.

It's important to remember that order matters. Each stage in the pipeline is an in-memory transformation of documents and is stateless. This means each stage only references the results produced from the previous stage.

Pipeline Optimization

It's generally a good idea to perform filtering and sorting early in the pipeline. This reduces the number of records that ultimately get processed. The $match and $sort operators can leverage indexes when placed at the beginning of the pipeline.

The aggregation pipeline has built in optimization for certain queries. When possible, Mongo will automatically refactor your queries to increase performance.

For example, this...

{ $sort: { likes : -1 } },
{ $match: { author: 'Joe' } }

becomes this...

{ $match: { author: 'Joe' } },
{ $sort: { likes : -1 } }

This makes sense because $match filters the collection before the $sort occurs. This increases performance since a $sort is only called after the collection is filtered with $match.

Remember that Mongo automatically optimizes the aggregation pipeline for you. Other optimizations include:

  • $project + $skip
  • $redact + $match
  • $sort + limit (v4)
  • $limit + $limit
  • $skip + $skip
  • $match + $match

These optimizations are specific to the version of Mongo you're using and occur automatically. You can include the explain option to see how the optimizer is transforming your queries under the hood.

MongoDB Aggregation Pipeline Examples

The following examples are based off this test "posts" collection:

{
   "title" : "my first post",
   "author" : "Jim",
   "likes" : 5,
   "tags" : ["angular", "react", "javascript"]
},
{
   "title" : "my second post",
   "author" : "Jim",
   "likes" : 2,
   "tags" : ["javascript", "react", "vue"]
},
{
   "title" : "hello world",
   "author" : "Joe",
   "likes" : 3,
   "tags" : ["vue", "react"]
}

$group

db.posts.aggregate([
    { $group: { _id:"$author", titles: { $push:"$title"}} }
])

Result...

{
    "_id" : "Joe",
    "titles" : [
        "hello world"
    ]
},
{
    "_id" : "Jim",
    "titles" : [
        "my first post",
        "my second post"
    ]
}

$group groups together documents based on a mandatory distinct key _id. In this example, we use the $push operator to generate an array with all of the titles for a particular author.

$match

db.posts.aggregate([
    { $match: { author:"Jim"} }
])

Result...

{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5.0,
    "tags" : [
        "angular",
        "react",
        "javascript"
    ]
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2.0,
    "tags" : [
        "javascript",
        "react",
        "vue"
    ]
}

$match filters the data based on a regular Mongo query expression. In this example, we filter the posts whose author's name is Jim.

$project

db.posts.aggregate([
    { $project: { author:1, abbrev:{ $substr: [ "$title", 0, 3 ] } } }
])

Result...

{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "author" : "Jim",
    "abbrev" : "my "
}

/* 2 */
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "author" : "Jim",
    "abbrev" : "my "
}

/* 3 */
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c6530"),
    "author" : "Joe",
    "abbrev" : "hel"
}

$project specifies the inclusion or exclusion of fields as well as the creation or modification of existing fields. In this example, $project is used to specify the inclusion of the author field via author: 1. It also specifies a new field abbrev which returns a substring of the title field.

$sum

db.posts.aggregate([
   { $group: { _id: "$author", total_likes: { $sum: "$likes" } } }
])

Result...

{
   "_id" : "Joe",
   "total_likes" : 3
},
{
   "_id" : "Jim",
   "total_likes" : 7
}

$sum is an accumulator expression which maintains state to perform computation. In this same example, notice how $group is used with $sum to get the total count of likes based on author.

$unwind

db.posts.aggregate([
   { $unwind: "$tags" }
])

Result...

{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5.0,
    "tags" : "angular"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5.0,
    "tags" : "react"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5.0,
    "tags" : "javascript"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2.0,
    "tags" : "javascript"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2.0,
    "tags" : "react"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2.0,
    "tags" : "vue"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c6530"),
    "title" : "hello world",
    "author" : "Joe",
    "likes" : 3.0,
    "tags" : "vue"
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c6530"),
    "title" : "hello world",
    "author" : "Joe",
    "likes" : 3.0,
    "tags" : "react"
}

$unwind deconstructs an array field and generates an output document for each element in the array. The tags field for each user is an array of string values. $unwind takes each of these string values and outputs a new document with the parent document's data.

Performance & Limitations

The aggregation pipeline either returns results in a cursor or stores the results in a collection. The end result has the same size limitations per document (16 megabytes). Additionally, each stage is limited to 100 megabytes of RAM. You can work around this limit with the allowDiskUse option, otherwise Mongo will throw an error.

Conclusion

The aggregation pipeline is the preferred method for aggregating data in MongoDB. With built in optimization, the pipeline makes it easy to progress data through different stages to compute results with native Mongo functionality.

Ready for more? Check out these advanced aggregation examples with $lookup.

Your thoughts?