mongodbself-join

Cooccurrence in Mongodb


I am familiar with relational database such as MySQL. Recently I'm working on project that use MongoDb to stock 100,000 Documents. The structure of these Docs is as such:

    {
      "_id" : "1",
      "abstract" : "In this book, we present ....",
      "doi" : "xxxx/xxxxxx",
      "authors" : [ 
                   "Davis, a", 
                   "louis, X", 
                   "CUI, Li", 
                   "FANG, Y"
                    ]
      }

I would like to extract the cooccurrence matrix or cooccurrence value of all possible combination of authors (only pairs). the expected output is:

{ [auth1 , auth2 : 5 ] [auth1, auth3 : 1] [auth2, auth8 : 9]....}

which means that auth1 and auth2 collaborate 5 times (in 5 books) auth2 and auth8 collaborate 9 times...

In relation database, possible solution could be: in a table auth-book for example:

    INSERT INTO auth-book (id_book, id_auth) VALUES
              (1, 'auth1'),
              (1, 'auth2'),
              (1, 'auth3'),
              (2, 'auth1'),
              (2, 'auth5'),
              (2, 'auth87'),
              (2, 'auth2')...

the query that calculate cooccurrence or collaboration of authors is:

     SELECT   a.id_auth a, b.id_auth b, COUNT(*) cnt
     FROM     auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND   .id_auth > a.id_auth
     GROUP BY a.id_auth, b.id_auth

output: [auth1 auth2 => 2][auth1 auth3 => 1][auth2 auth3 => 1]... etc.

I have no idea how to implement such a query in mongodb


Solution

  • This is a wee bit scary but it seems to work. Basically, we $lookup against ourselves and leverage the fact that lookup of a string into an array of strings is an implicit $in-like operation.

     db.foo.aggregate([
     // Create single deduped list of authors:
     {$unwind: "$authors"}
     ,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
    
     // Now unwind and go lookup into the authors array in the same collection.  If $uu
     // appears anywhere in the authors array, it is a match.
     ,{$unwind: "$uu"}
     ,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}
    
     // OK, lots of material in the pipe here.  Brace for a double unwind -- but
     // in this data domain it is probably OK because not every book has the same
     // set of authors, which will attenuate the result $lookup above.
     ,{$unwind: "$X"}
     ,{$unwind: "$X.authors"}
    
     // The data is starting to be clear here.  But we see two things: 
     // self-matching (author1->author1) and pairs of inverses
     // (author1->author2 / author2->author1).  We don't need self matches and we
     // only need "one side" of the inverses.   So... let's do a strcasecmp
     // on them.  0 means the same so that's out.  So pick 1 or -1 to get one side
     // of the pair; doesn't really matter!  This is why the OP has b.id_auth > a.id_auth
     // in the SQL equivalent: to eliminate self match and one side of inverses.
     ,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
     ,{$match: {q: 1}}  
    
     // And now group!
     ,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
                ]);