pythonfirebasegoogle-cloud-firestore

What can speed up this Firestore batch commit?


Summary

I'm using Firestore query listeners on a web client to pass python worker results back to browser as a json string. The results include a lot of Plotly graphs and the total size is ~3Mb of data. To respect the Firestore doc size limit I have chunked the json into 300kb docs in a collection with an ascending key for reassembly on client side. This collection is written as a batch by the python worker to limit client updates, but I found the batch commit time to be around the sum of individual doc write times. This persists even when all indices are turned off for the collection as suggested by the firebase team. I've moved on to using firebase storage instead which seems a better solution, but still want to know if this is really a fundamental limitation of Firestore or if it can be improved upon.

Expected Results

Once the python Firestore client is instantiated, I would expect a batch commit to incur a fairly constant network connection + latency overhead and then the batch commit time to scale with two factors:

In this context, passing the ~3.6Mb of data between Heroku (live environment where I've observed the same times as my local Jupyter experiment) and Google should take about 0.1 seconds and with all indexing turned off I wouldn't expect writing the data to take much longer on the Firestore server side. So perhaps all-in time <0.5 sec

What I've tried

Experimental setup:

Code

You'll need a firebase admin environment to test this, but the building blocks are below:

Make a single-field exemption for all field indexes (*) in the Firestore console for the 'test_batch_multi' collection. Confirm by exporting indexes from Firestore CLI and you should see this in your json:

"fieldOverrides": [
    {
      "collectionGroup": "test_batch_multi",
      "fieldPath": "*",
      "ttl": false,
      "indexes": []
    }
  ]

Python Notebook (switch %%time to logging for non-notebook)

#instantiate firebase objects per your own setup, for example
fb_admin_dict = json.loads(os.getenv('FIREBASE_ADMIN_CRED'))
fb_app = firebase_admin.initialize_app(Certificate(fb_admin_dict))
fb_db: FirestoreClient = firestore.client(fb_app)

#The large data commit is a long json string chunked into 300000 character strings
#so just make up a single string to test with
#json_chunk = 'abc' #use this to get min time baseline
json_chunk = random_string = ''.join(random.choices(string.ascii_letters + string.digits, k=300000))

single_coll = fb_db.collection('test_batch')
batch = fb_db.batch()
batch.set(single_coll.document(), {"chunk_num":1, "json_chunk": json_chunk })

Isolate the commit:

%%time
batch.commit()

Then test the multi-document batch commit time

multi_coll = fb_db.collection('test_batch_multi')
batch = fb_db.batch()
for i in range(12):
    batch.set(multi_coll.document(), {"chunk_num":i, "json_chunk": json_chunk })

Isolate the commit:

%%time
batch.commit()

Solution

  • I found the batch commit time to be around the sum of individual doc write times

    That fits with what I'd expect. Using batch writes does not have any inherent performance benefits, and may in fact even be (slightly) slower than the underlying individual document writes. Did you see What is the fastest way to write a lot of documents to Firestore?

    Outside of what I documented there, the fastest way I've found to write is with a BulkWriter object. I don't think that exists in the Python SDK, but if you can use Node.js - I have pretty exclusively switched to using BulkWriters for bulk inserts.