pythonmongodbormbatch-processingmongoengine

Most efficent way to bulk update Documents using MongoEngine


So, I have a Collection of documents (e.g. Person) structured in this way:

class Person(Document):
    name = StringField(max_length=200, required=True)
    nationality = StringField(max_length=200, required=True)
    earning = ListField(IntField())

when I save the document I only input the name and nationality fields because this is the information.

Then, every now and then, I want to update the the earning of each person of a particular nationality. Let's imagine that there is some formula that allows me to compute the earning field (e.g. I query some magical api called EarningAPI that returns the earning of a person given its name).

To update them I would do something like:

japanese_people = Person.objects(Q(nationality='Japanese'))).all()
for japanese_person in japanese_people:
    japanese_person.earning.append(EarningAPI(japanese_person.name))

Person.objects.insert(japanese_people, load_bulk=False) 

The EarningAPI has also the possibility to work in batches, so that i can give a list of names and it returns a list of earning(s) (one for each name). This method is far faster and less expensive.

Is the one by one way correct? What is the best way to take advantage of the batches?

Thanks


Solution

  • Using method from Mongoengine bulk update without objects.update():

    from pymongo import UpdateOne
    from mongoengine import Document, ValidationError
    
    class Person(Document):
        name = StringField(max_length=200, required=True)
        nationality = StringField(max_length=200, required=True)
        earning = ListField(IntField())
    
    japanese_people = Person.objects(Q(nationality='Japanese')).all()
    
    japanese_ids = [person.id for person in japanese_people]
    earnings = EarningAPI(japanese_ids) 
    # I'm assuming it takes a list of id's as input and returns a list of earnings. 
    
    bulk_operatons = [
        UpdateOne(
            {'_id': j_id},
            {'$set': {'earning': earn}},
            upsert=True
        ),
        for j_id, earn in zip(japanese_ids, earnings)
    ]
    
    result = Person._get_collection().bulk_write(bulk_operations, ordered=False)
    

    I can't be certain if this is faster than the one by one method because I don't have access to your magic API to benchmark, but this should be the way to do it by batch.