mongodbaggregation-frameworkpymongo

Aggregation pipeline MongoDB to avoid multiple queries


I have post documents like this :

post = {
  "_id": "uid1",
  "userID": "user1",
  "likeNum": 30,
}

My current pipeline looks like this with user_ids an array of user IDs to avoid and seen_ids also an array of post IDs to avoid:

fake_pipeline = [
            {'$match': {'userID': {'$nin': user_ids}, '_id': {"$nin": seen_ids}}},
            {'$group': {'_id': '$userID', 'posts': {'$push': '$likeNum'}}},
            {'$project': {'posts': {'$slice': [{'$sortArray': {'input': '$posts', 'sortBy': {'likeNum': -1}}}, 2]}}},
            {'$limit': 10}
        ]

I would like the aggregation to return a list of 10 post ordered by likeNum descending but with at max 2 post per userID.

Example :

post_list_in_db = [
  {"_id": "uid1", "userID": "user1", "likeNum": 29},
  {"_id": "uid2", "userID": "user1", "likeNum": 2},
  {"_id": "uid3", "userID": "user1", "likeNum": 13},
  {"_id": "uid4", "userID": "user2", "likeNum": 21},
  {"_id": "uid5", "userID": "user2", "likeNum": 19},
  {"_id": "uid6", "userID": "user3", "likeNum": 1},
  {"_id": "uid7", "userID": "user3", "likeNum": 8},
  {"_id": "uid8", "userID": "user3", "likeNum": 14},
  {"_id": "uid9", "userID": "user3", "likeNum": 4},
  {"_id": "uid10", "userID": "user4", "likeNum": 20},
  {"_id": "uid11", "userID": "user4", "likeNum": 9},
  {"_id": "uid12", "userID": "user4", "likeNum": 11},
]

The expected output is :

[
  {"_id": "uid1", "userID": "user1", "likeNum": 29},
  {"_id": "uid4", "userID": "user2", "likeNum": 21},
  {"_id": "uid10", "userID": "user4", "likeNum": 20},
  {"_id": "uid5", "userID": "user2", "likeNum": 19},
  {"_id": "uid8", "userID": "user3", "likeNum": 14},
  {"_id": "uid3", "userID": "user1", "likeNum": 13},
  {"_id": "uid12", "userID": "user4", "likeNum": 11},
  {"_id": "uid7", "userID": "user3", "likeNum": 8},
]

Note : for performance reason, i would like the minimum code in the aggregation, no need to format the data perfectly like in the example output unless it has no impact on performances, otherwise i prefer reorder / transform objects myself in the code as long as i have the correct post list content (but not especially formatted or ordered) as output of the aggregation.


Solution

  • You're quite close to what you need.

    1. Sort on $likeNum after the $match stage.

    2. Then when you $group on userID, you can use the $firstN aggregation accumulator selecting just the first 2 posts per user. So these will be the highest 2 likeNum's per UserID (since we sorted before this).

      • Also, use "$$ROOT" to get the actual post document rather than just the likeNum.
    3. Then unwind the accumulated "$posts", replace those as the docs themselves, and sort again by likeNum.

    (The parts in blue are the differences from your pipeline.)

    db.post.aggregate([
      {"$match": {"userID": {"$nin": user_ids}, "_id": {"$nin": seen_ids}}},
      { $sort: { likeNum: -1 } },
      {
        "$group": {
          "_id": "$userID",
          "posts": {
            $firstN: {
              input: "$$ROOT",
              n: 2
            }
          }
        }
      },
      { $unwind: "$posts" },
      { $replaceWith: "$posts" },
      { $sort: { likeNum: -1 } },
      { "$limit": 10 }
    ])
    

    Mongo Playground. (No changes to the first $match stage, I've just put ["user3"] and ["uid9"] as values in the playground example.)

    Wrt "i prefer reorder / transform objects myself in the code" - The first sort cannot/should not be done in code, since it's for all posts after the match. The second $sort could be done in code but it's likely to be slower than letting MongoDB sort and then select just 10 results. Otherwise, you need to fetch all results and then sort & limit yourself. And if you plan to have pagination, the skip + limit step have to be done by MongoDB after the sort is also done by MongoDB.