mongodbmongoid

Mongoid aggregation: $group partially after $project, or remove duplicates in a projected output


Let's say I have tagged conversations, and conversations have messages which have created_at and read_at(=nil). I'm looking to have, for all conversations in a tag, their unread messages and the last message regardless if its unread. I have this so far, but the last message is duplicated if it's unread... I'm looking to remove the duplicated message.


    match = { '$match': {
      'tag_ids': {
        '$in': [
          BSON::ObjectId('658db0d00f5b4f4471bdb3b6'), # inbox
        ],
      }
    } }

    unread_messages = { '$lookup': {
      'from':         'wco_email_message',
      'localField':   '_id',
      'foreignField': 'conversation_id',
      'pipeline': [
        { '$match': { 'read_at': nil } },
        { '$sort': { 'created_at': 1 } },
        { '$project': { 'created_at': 1 } },
      ],
      'as':           'unread_messages',
    } }

    last_message = { '$lookup': {
      'from':         'wco_email_message',
      'localField':   '_id',
      'foreignField': 'conversation_id',
      'pipeline': [
        { '$sort': { 'created_at': -1 } },
        { '$limit': 1 },
        { '$project': { 'created_at': 1 } },
      ],
      'as':           'last_message',
    } }
    project = { '$project': {
      "subject": 1,
      "all_messages": { '$concatArrays': [ '$unread_messages', '$last_message' ] },
    } }

    group = { '$group': {
      '_id': {
        "subject": 1,
        "all_messages": { '$addToSet': '$all_messages' },
      },
    } }


    outs = WcoEmail::Conversation.collection.aggregate([
      match,
      unread_messages,
      last_message,
      project,
      # group,
    ]).to_a

Output:

[{"_id"=>BSON::ObjectId('65985e070f5b4f7476a2b1c8'),
  "subject"=>"Re: JavaScript Developer _ Austin, TX.",
  "all_messages"=>
   [{"_id"=>BSON::ObjectId('65985ba10f5b4f52505dbe1a'), "created_at"=>2024-01-05 19:42:25.238 UTC},
    {"_id"=>BSON::ObjectId('65985e070f5b4f7476a2b1c9'), "created_at"=>2024-01-05 19:52:39.506 UTC},
    {"_id"=>BSON::ObjectId('65985eda0f5b4f760e1f4265'), "created_at"=>2024-01-05 19:56:10.674 UTC},
    {"_id"=>BSON::ObjectId('65985eda0f5b4f760e1f4265'), "created_at"=>2024-01-05 19:56:10.674 UTC}]},
 {"_id"=>BSON::ObjectId('659f36550f5b4f114ea648ac'),
  "subject"=>"You've been chosen!",
  "all_messages"=>
   [{"_id"=>BSON::ObjectId('659f3bdc0f5b4f1a31928d7c'), "created_at"=>2024-01-11 00:52:44.109 UTC},
    {"_id"=>BSON::ObjectId('659f3bdc0f5b4f1a31928d7c'), "created_at"=>2024-01-11 00:52:44.109 UTC}]},
 {"_id"=>BSON::ObjectId('659f3c970f5b4f1b3ee43a78'),
  "subject"=>"Re: Wasya Co Inquiry",
  "all_messages"=>[{"_id"=>BSON::ObjectId('659f3c970f5b4f1b3ee43a79'), "created_at"=>2024-01-11 00:55:51.927 UTC}]},
 {"_id"=>BSON::ObjectId('65bb350b0f5b4f1e126bb4ee'),
  "subject"=>"a Deploy Script",
  "all_messages"=>
   [{"_id"=>BSON::ObjectId('65bb37ea0f5b4f347256f121'), "created_at"=>2024-02-01 06:19:22.991 UTC},
    {"_id"=>BSON::ObjectId('65bb37ea0f5b4f347256f121'), "created_at"=>2024-02-01 06:19:22.991 UTC}]},
 {"_id"=>BSON::ObjectId('65d524d8767ccd3c1cf87d2b'),
  "subject"=>"Resend of AWS Message",
  "all_messages"=>
   [{"_id"=>BSON::ObjectId('65d524d8767ccd3c1cf87d2c'), "created_at"=>2024-02-20 22:16:56.068 UTC},
    {"_id"=>BSON::ObjectId('65d5258a767ccd3c1cf87d2d'), "created_at"=>2024-02-20 22:19:54.454 UTC},
    {"_id"=>BSON::ObjectId('65d5258a767ccd3c1cf87d2d'), "created_at"=>2024-02-20 22:19:54.454 UTC}]}]

you can see with timestamps, the last two messages in all conversations are duplicated, except one where there are no unread messages.


Solution

  • Instead of using $concatArrays to put the arrays together, you could use $setUnion that treats the arrays as sets and removes duplicate elements.

    You can use this mongoplayground to test; I have replicated the relevant parts of your sample (without the $group stage at the end, and with the extra field read_at so that you can discern between read/unread).