Let's say I have a few million documents in a MongoDB collection that look like this:
[
{
"timestamp": "2024-01-01T00:00:00Z",
"schema": "1.0.0",
"value": 3,
},
{
"timestamp": "2024-01-01T01:00:00Z",
"schema": "1.2.0",
"value": -10,
},
...
]
Now, I want to do the following, using an aggregation pipeline:
value
(keeping the original sign), plus adding the original document's timestamp
and schema
.So, the desired output is something like:
[
// January bucket
{
"bucket": "2024-01-01T00:00:00Z",
"value": {
"timestamp": "2024-01-01T01:00:00Z",
"schema": "1.2.0",
"absMax": -10
}
}
]
Obviously, the default $max
accumulator does not work, as it has two problems:
timestamp
and schema
, but only outputs the numeric valueSo, I tackled the problem trying two different ways:
$group
stage, I use $push
to push all raw documents into a $raw
document, which I then go through with $reduce
. I need the $raw
document to have timestamp
and schema
always available.$group
stage, I use a custom accumulator function (see https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/) that reduces each document and keeps the original timestamp
and schema
next to the absolute maximum in its state.Now, I encounter the following issues:
I updated my question with MongoDB playgrounds:
$push
): https://mongoplayground.net/p/01e5Oa58VfVAnd for the sake of completeness using only $min
and $max
, but losing timestamp
and schema
: https://mongoplayground.net/p/UegNExWo2np
Solution 2 is about twice as fast as solution 1 on big data sets.
Am I overlooking something?
You can group easier with $dateTrunc. For min and max value you can use $bottom or $first
db.collection.aggregate([
{ $set: { absValue: { $abs: "$value" } } },
{
$group: {
_id: {
$dateTrunc: {
date: "$timestamp",
unit: "month",
timezone: "Europe/Zurich"
}
},
maxValue: {
$bottom: {
sortBy: { absValue: 1 },
output: {
timestamp: "$timestamp",
schema: "$schema",
value: "$value"
}
}
}
}
}
])
For the minValue
use either $top
or change the sort order to sortBy: { absValue: -1 }
If you like to use $first
and $last
then you need to sort the entire collection, i.e.
db.collection.aggregate([
{ $set: { absValue: { $abs: "$value" } } },
{ $sort: { absValue: 1 } },
{
"$group": {
"_id": {
$dateTrunc: {
date: "$timestamp",
unit: "month",
timezone: "Europe/Zurich"
}
},
value: {
$last: {
timestamp: "$timestamp",
schema: "$schema",
value: "$value"
}
}
}
}
])
But I think this will be slower.
The rest is just some cosmetic.