rethinkdbrethinkdb-pythonreql

RethinkDB update field inside nested array->obj


Here is how my row looks like:

{"id": x ,

"data": [
  {
  "someId": 1 ,
  "url": https://twitter.com/HillaryClinton?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor, »
  } ,

  {
  "someId": 2 ,
  "url": http://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html, »
  } ,
]}

I created secondary index on data.url so retriving document is easy, but how do I most efficiently update just that specific,nested object?

I might be adding new keys to it or just updating existing ones(newField, anotherField in example below).

End result should look like:

  "data": [
  {
  "someId": 1 ,
  "url": https://twitter.com/HillaryClinton?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor,

  "newField": 5,
  "anotherField": 12
  } 
...

edit: made it work like this(python):

a = r.db("lovli").table("KeyO").get_all("https://www.hillaryclinton.com/", index= "url").update(
    lambda doc:
        {"data": doc['data'].map(lambda singleData:
            r.branch(
            singleData['url'] == "https://www.hillaryclinton.com/",
            singleData.merge({"status_tweet":3, "pda": 1}),
            singleData
            )
        )
        }

).run(conn)

Can this be improved? Also, I am going to be updating a lot of urls at the same time... Anyway to further improve performance by doing this in bulk?


Solution

  • Just as a general piece of advice, if you plan on doing many operations to embedded documents inside of an array. You might want to unwind and flatten your data model.

    It's both awkward and difficult to update a single document in an array, here are the steps in one possible approach:

    1. Get the value of the original embedded document you want to edit
    2. Get the index of that embedded document it's index
    3. Modify the embedded document
    4. Replace the original embedded document in the array with the modified
    5. Update the whole outer document.

    Or, what you've done is:

    1. Narrow down documents to a certain domain, in this case "hillaryclinton.com"
    2. Map over N tweets in the data array and do a check if they match, and update if they do.

    In this case, in the worst scenario you're going to hit a filtered number of M documents, times N embedded documents. You might want to check out this other answer I gave with someone who had a similar design question. But, I think the following would result in superior performance.

    If you instead stored your data like this:

    {
      "secondary_id": x,
      "data": {
        "someId": 1 ,
        "url": "https://twitter.com/HillaryClinton?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor", 
      } 
    }, {
     "secondary_id": x,
     "data": {
      "secondary_id": 2 ,
      "url": "http://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html", 
      }
    } 
    

    Then you can create a compound index on whatever your url index is based on and the value data.url, possibly reducing your operations significantly.

    Without knowing more about what you will access frequently in your dataset it's hard to give more pointers, but I think this should work better. When you want to recreate the original data model you have, it would look like this:

    r.db("lovli").table("Key0").get_all( SEARCH_URL, index="url").group("secondary_id")
    

    And you would get something like this back. x is used as an example here:

    {
      "group": x,
      "reduction": [{
          "secondary_id": x,
          "data": {
            "someId": 1 ,
            "url": "https://twitter.com/HillaryClinton?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor", 
          } 
        }, {
         "secondary_id": x,
         "data": {
          "secondary_id": 2 ,
          "url": "http://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html", 
          }
        }]
    }