I am familiar with relational database such as MySQL. Recently I'm working on project that use MongoDb to stock 100,000 Documents. The structure of these Docs is as such:
{
"_id" : "1",
"abstract" : "In this book, we present ....",
"doi" : "xxxx/xxxxxx",
"authors" : [
"Davis, a",
"louis, X",
"CUI, Li",
"FANG, Y"
]
}
I would like to extract the cooccurrence matrix or cooccurrence value of all possible combination of authors (only pairs). the expected output is:
{ [auth1 , auth2 : 5 ] [auth1, auth3 : 1] [auth2, auth8 : 9]....}
which means that auth1 and auth2 collaborate 5 times (in 5 books) auth2 and auth8 collaborate 9 times...
In relation database, possible solution could be: in a table auth-book for example:
INSERT INTO auth-book (id_book, id_auth) VALUES
(1, 'auth1'),
(1, 'auth2'),
(1, 'auth3'),
(2, 'auth1'),
(2, 'auth5'),
(2, 'auth87'),
(2, 'auth2')...
the query that calculate cooccurrence or collaboration of authors is:
SELECT a.id_auth a, b.id_auth b, COUNT(*) cnt
FROM auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND .id_auth > a.id_auth
GROUP BY a.id_auth, b.id_auth
output: [auth1 auth2 => 2][auth1 auth3 => 1][auth2 auth3 => 1]... etc.
I have no idea how to implement such a query in mongodb
This is a wee bit scary but it seems to work. Basically, we $lookup
against ourselves and leverage the fact that lookup of a string into an array of strings is an implicit $in
-like operation.
db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
// Now unwind and go lookup into the authors array in the same collection. If $uu
// appears anywhere in the authors array, it is a match.
,{$unwind: "$uu"}
,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}
// OK, lots of material in the pipe here. Brace for a double unwind -- but
// in this data domain it is probably OK because not every book has the same
// set of authors, which will attenuate the result $lookup above.
,{$unwind: "$X"}
,{$unwind: "$X.authors"}
// The data is starting to be clear here. But we see two things:
// self-matching (author1->author1) and pairs of inverses
// (author1->author2 / author2->author1). We don't need self matches and we
// only need "one side" of the inverses. So... let's do a strcasecmp
// on them. 0 means the same so that's out. So pick 1 or -1 to get one side
// of the pair; doesn't really matter! This is why the OP has b.id_auth > a.id_auth
// in the SQL equivalent: to eliminate self match and one side of inverses.
,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
,{$match: {q: 1}}
// And now group!
,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
]);