python-2.7google-app-enginegoogle-cloud-datastoreapp-engine-ndbgoogle-app-engine-python

TransactionFailedError (too much contention...) when reading (cross-group) entities from datastore


I’m investigating again the unexpected occurrence of TransactionFailedError (too much contention on these datastore entities... in cases, where the code only reads entity groups that are blamed for the contention problems.

Setup

GAE standard environment, Python 2.7 with NDB (SDK 1.9.51). I managed to observe the error in an isolated app (only me as user) where the same request handler is executed in a task queue and read/write access to the entity-groups mentioned below is only done by this handler.

The handler is executed a few times per second and basically is a migration / copy task to move existing OriginChild entities out of a huge group into individual groups as new Target entities. It is one task per OriginChild entity.

Inside a cross group transactional function, ndb.transaction(lambda: main_activity(), xg=True), each request handler...:

So, these are read-only entities in the transaction:

The code doesn't make any changes, these entities are not deleted or written back to the datastore. Moreover, there are no other requests running that try to write into these entity groups - no write ops at all in these groups for months).

The only entity that is put to the datastore is Key(Target, Foo) where the ID is unique per request.

Errors

Approximately 60-70% of the requests will run with-out errors.

When the TransactionFailedError occurs, it will be inside the transactional function, the logging shows something like this:

suspended generator get(context.py:758) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "e~my-test-app" name_space: "test" path < Element { type: "OriginGroup" id: 1 } > )

In ~80% of the failed requests, the error will relate to Key(OriginGroup, 1) (although the entire groups is used read-only).

In ~10% of the failed requests the error will show Key(TargetConfig, 1) (read-only, too).

In the remaining ~10% it will blame the new entity, e.g. Key(Target, Foo), or for whatever TargetChild's ID the request performs the migration and it seems it happens only during the put(), not the get() attempt before.

Theories

My understanding of transactions and entity groups is that NDB follows an optimistic concurrency control, so massive read ops from the same entity-group is possible (hence scalability), and due to technical reasons only for transactional write operations there is the limitation of ~ 1 write op per entity group per second, and not more than 25 entity groups per transaction.

However, my observations suggest that reading ops can also cause too much contention errors. But this idea also baffles me, because it would make GAE with Datastore much less scalable if you are aiming for strong consistency. So maybe there is something else going on here.

I have found this comment on SO which claims that my assumption is right:

"Note: The first read of an entity group in an XG transaction may throw a TransactionFailedError exception if there is a conflict with other transactions accessing that same entity group. This means that even an XG transaction that performs only reads can fail with a concurrency exception."

Source: Contention problems in Google App Engine

I was able to find the quote in the new docs, now under Superseded Storage Solutions > DB Client Library for Cloud Datastore > Overview

Questions

Is the quoted statement still true for NDB (or only for DB and/or for version conflicts)?

If it is true: What pattern would be recommended to avoid the contention error with transactional reads across entity groups?


Solution

  • In a transaction where there is at least one write, in this case Key(Target, Foo), Cloud Datastore will write no-op markers to the entity groups that are read but not modified. This is to ensure serializability.

    Since Key(OriginGroup, 1) and you are doing XG transactions faster than 1 per second over an extended period, this is the source of our contention.

    One alternative to consider is a batching strategy that writes 23 Key(Target, Foo) entities at a time rather than one. Key(Origin, 1) and Key(TargetConfig, 1) takes the other 2 entity-group slot.