Mongodb Aggregation Framework

来源：互联网发布：单片机输入输出编辑：程序博客网时间：2024/06/05 15:17

The MongoDB aggregation framework provides a means to calculate aggregate values without having to use Map-Reduce. Map-Reduce is very powerful, but it is also harder than necessary for simpler things such as totaling or averaging field values. For those familiar with SQL, the aggregation framework can be used to do the kind of thing that SQL does with group-by and distinct, as well as some simple forms of self-joins.

The aggregation framework also provides projection facilities that can be used to reshape data. This includes the ability to add computed fields, to create new virtual sub-objects, and to extract sub-fields and bring them to the top-level of results.

An introductory presentation from MongoSV 2011 is available here.

Using the Aggregation Framework

The aggregation framework relies on two key concepts: pipelines and expressions.

Pipelines

A pipeline is a sequence of operations that is applied to a stream of documents. For those familiar with linux command line shells, this is very similar to a pipe. In a linux shell, a pipe is a series of programs that each process a stream of characters. The MongoDB aggregation pipeline is a series of pipeline operators, each of which processes a stream of documents. Logically, the pipeline behaves as if a collection is being scanned, and each document found is passed into the top of the pipeline. Each operator in the pipeline can transform each document as it goes by, until they appear at the end of the pipeline. Pipeline operators need not produce one output document for every input document: operators may also generate new documents, or filter out documents, so that they do not go any further in the pipeline.

Pipeline Operators

The links in the list below link to detailed descriptions of each operator.

$project - select columns or sub-columns, create computed values or sub-objects
$match - filter documents out from the document stream
$limit - limit the number of documents that pass through the document stream
$skip - skip over a number of documents that pass through the document stream
$unwind - unwind an array, subsituting each value in the array for the array within the same document
$group - group documents by key and calculate aggregate values for the group
$sort - sort documents by key
[$out] - save documents to a collection and pass them on like a tee

Pipeline operators appear in an array. Documents pass through these operators and come out at the other end.

Expressions

Expressions are used to calculate values. In keeping with MongoDB's JSON heritage, expressions are defined in a prefix format using JSON.

Expressions are usually stateless, and are just evaluated when they are seen. These can do things such as adding the values of two fields together, or extracting the year from a date.

It is also possible to use accumulator expressions which retain state. These types of expressions are used in the $group operator in order to maintain counts, totals, maxima, etc, as documents go through the pipeline.

There are many examples in the operator documentation found above. For the complete list, see the Aggregation Framework - Expression Reference.

Invocation

Aggregation is invoked as a command with two operands:

aggregate - provide the name of the collection to use at the head of the pipeline
pipeline - an array of pipeline operators, each with its own operands; see the examples in the pipeline operator references above

As a command, invocation of the aggregation pipeline is the same in all drivers. Use your host programming language to create a database object, with the fields above, and then submit it as a command.

Here are some examples of pipelines that can be issued from the mongo shell. For the examples that follow, imagine an article collection made up of documents that look like this:

{    title : "this is my title" ,     author : "bob" ,     posted : new Date() ,    pageViews : 5 ,     tags : [ "fun" , "good" , "fun" ] ,    comments : [         { author :"joe" , text : "this is cool" } ,         { author :"sam" , text : "this is bad" }     ],    other : { foo : 5 }});

The following example pivots data to create a set of author names grouped by tags applied to an article:

var g5 = db.runCommand({ aggregate : "article", pipeline : [    { $project : {author : 1,tags : 1,    }},    { $unwind : "$tags" },    { $group : {_id : { tags : 1 },authors : { $addToSet : "$author" }    }}]});

Result Format

The result of a successful aggregation command is an document with two fields:

result - an array of documents that came out of the pipeline
ok - a field with the value 1, indicating success, or another value if there was an error

As a document, the result is subject to the current BSON Document size limit; see Maximum Document Size. If you expect a large result, use the $out pipeline operator to write it to a collection.

Optimizing Performance

Early Filtering

Logically, operation proceeds as if a collection scan is done to feed documents into the pipeline. For some pipelines, this may not be optimal.

If your aggregation operation does not require all of the data in a collection, you are likely to use a $match to filter out items you do not want to include. The aggregation framework recognizes matches, and will attempt to find a suitable index to use to access the matching elements of the collection. In the simplest case, where a $match appears first in the pipeline, pipeline will be fed with the result of a query (see Querying and Advanced Queries).

In order to take advantage of this, before execution, an optimization phase will try to re-arrange the pipeline so that any $match operators are moved as far towards the beginning as possible. As a simple example, if a pipeline begins with a $project that just renames fields, followed by a $match, the $match can be pushed in front of the $project by renaming the fields appropriately.

Over time, we expect to apply more of these kinds of optimizations, but is the initial release, put $match operators at the start of your pipeline whenever possible.

Memory for Cumulative Operators

Certain pipeline operators need to see their entire input set before they can produce any output. For example, $sort must see all of its input before producing its first output document. The current implementation does not go to disk in such cases, and all of the input must fit in memory to be sorted.

$group has similar characteristics, and must also see all of its input before anything can be produced. However, this usually doesn't require as much memory as sorting, because only one record needs to be kept for each unique key in the grouping specification.

The current implementation will log a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators will signal an error if they consume 10% or more of the physical memory on the host.

Sharded Operation

The aggregation framework can be used on sharded collections.

When the source collection is sharded, the aggregation pipeline will be split into two parts. All of the operators up to and including the first $group or $sort are pushed to each shard. (If an early $match can exclude shards through the use of the shard key in the predicate, then these operators are only pushed to the relevant shards.) A second pipeline, consisting of the first $group or $sort and any remaining pipeline operators, is executed in mongos, using the results received from the shards.

For $sort, the results are merged. For $group, any "sub-totals" are brought in and combined; in some cases these may be structures. For example, the $avg expression maintains a total and count for each shard; these are combined in mongos and then divided.

原文地址：http://www.mongodb.org/display/DOCS/Aggregation+Framework