I'm using RethinkDB (2.3.5~0trusty) and python (2.7.6) and python-rethinkdb bindings (2.3.0.post6), and trying to query a pre-existing database. (ie please assume I can't change the schema that I'm working with)
This is a simplified example of the problem I have.
Names of tables/etc have been moved into a domain people are more used to, but the problem I have is structurally equivalent (and more compact).
Borrowers
Schema: (borrower_id:key (primary), name :string, favourite_authors : list of author_id )
Sample data:
{ "borrower_id": "91a15585-f084-41b3-9df1-1a3b16a8daed",
"name": "Jo",
"favourite_authors" : [ "b9503702-8832-43c8-a3f0-34691635419a",
"3bae9a66-2de6-4c64-ae95-c5f7caad86bb",
]
}
{ "borrower_id": "23a8a193-c32e-4332-a40b-2ba56d158205",
"name": "Bob",
"favourite_authors" : [ "41305d3b-2819-4af5-be62-3c7999c4d747",
"d270f08d-aab1-4644-8dea-8f4fdd2d80b4"
]
}
{ "borrower_id": "01031fb0-35de-4324-af47-611fec9ca7ad",
"name": "Sam",
"favourite_authors" : [ "b9503702-8832-43c8-a3f0-34691635419a"
]
}
Authors
Schema: (author_id:key (primary), name :string, books_written : list of book_id )
Sample data:
{ "author_id": "b9503702-8832-43c8-a3f0-34691635419a",
"name": "Joanna Smith",
"books_written" : [ "c1a48e2e-a831-4f5b-95b2-9b429dcf34e5",
"8f0e89b6-78e8-45ec-b7db-9cf3e00e0a8d",
]
}
{ "author_id": "3bae9a66-2de6-4c64-ae95-c5f7caad86bb",
"name": "John Smith",
"books_written" : [ "8f0e89b6-78e8-45ec-b7db-9cf3e00e0a8d",
"b9fb4de0-e3bd-4df1-b192-c9a0ae7fb2e1",
]
}
{ "author_id": "41305d3b-2819-4af5-be62-3c7999c4d747",
"name": "Jo Smith",
"books_written" : [ "b9fb4de0-e3bd-4df1-b192-c9a0ae7fb2e1",
"37b6eb03-e8ea-43dc-b3e4-ffc0bbfb1154",
]
}
{ "author_id": "d270f08d-aab1-4644-8dea-8f4fdd2d80b4",
"name": "Jim Smith",
"books_written" : [ "8f0e89b6-78e8-45ec-b7db-9cf3e00e0a8d",
"37b6eb03-e8ea-43dc-b3e4-ffc0bbfb1154",
]
}
Books
Schema: (book_id:key (primary), name:string, book_info: object, may contain a data
dict, that has a list of
repeatable metadata options...)
Sample data:
{ "book_id": "c1a48e2e-a831-4f5b-95b2-9b429dcf34e5",
"name": "",
"book_info" : {
"data" : [
{ "tag": "sf },
{ "period" : "past"}
]
}
}
{ "book_id": "8f0e89b6-78e8-45ec-b7db-9cf3e00e0a8d",
"name": "",
"book_info" : {
"data" : [
{ "tag": "romance },
{ "period" : "present"}
]
}
}
{ "book_id": "89b68f0e-78e8-45ec-b7db-9cf3e00e0a8d",
"name": "",
"book_info" : {
"data" : [
{ "period" : "present"}
]
}
}
{ "book_id": "b9fb4de0-e3bd-4df1-b192-c9a0ae7fb2e1",
"name": "",
"book_info" : {
"data" : [
{ "tag": "sf },
{ "tag": "romance},
{ "period" : "present"}
]
}
}
...
{ "book_id": "37b6eb03-e8ea-43dc-b3e4-ffc0bbfb1154",
"name": "",
"book_info" : {
"data" : [
{ "tag": "sf },
{ "period" : "future"}
]
}
}
Now, the common query I want to perform is equivalent to this:
Note: Not all books have a tags option in the data section of book info...
I'm trying to figure out a combination of joins and filters using RethinkDB's ReQL query interface - which I'm assured can do this - but I can't see an obvious way of ding it.
My starting point was to looking at RethinkDB's various join options, but I can't see any obvious way of performing a join using an attribute that contains a list of foreign keys rather than just an atomic key. (I would normally either put the foreign keys on the field that gets repeated or have a relation table, but as I say, I can't change the structure I've got)
I'd prefer a python oriented solution, but javascript (or any other lang) would be handy, since I can then convert :-)
Any suggestions welcome.
This code (Javascript) will do what:
r.db("test").table("Borrowers").filter(function(borrower){
return borrower("favourite_authors").setIntersection(r.db("test").table("Authors").filter(function(author){
return author("books_written").setIntersection(r.db("test").table("Books").filter(function(book){
return book("book_info")("data").contains({"tag": "sf"}); })("book_id").coerceTo("array")).isEmpty().not();})("author_id").coerceTo("array")).isEmpty().not();})
But is already very slow, on a db containing just the sample data (35-70ms on my server)
It is basically a conjuction of 3 subqueries:
1:
r.db("test").table("Books").filter(function(book){
return book("book_info")("data").contains({"tag": "sf"}); })("book_id").coerceTo("array")
This is the most inner one. It retrieves an array containing the id's of all books tagged as sf. This array is put into the following subquery:
r.db("test").table("Authors").filter(function(author){
return author("books_written").setIntersection(<book ids go here>).isEmpty().not();})("author_id").coerceTo("array")
Which retrieves an array of all author id's that participated in one or more of the given books. It filters by non emptiness of the intersection of the authors books and the array of sf books. (If the intersection is non empty, at least one book of the author is tagged as sf)
r.db("test").table("Borrowers").filter(function(borrower){
return borrower("favourite_authors").setIntersection().isEmpty().not();})
The last part is based on the same principle as the second one and finally returns the borrowers who favour an author who has written a sf book.