I try to understand the impact of multi-partitioned transactions in VoltDB 9.x. I know it is designed for single-partioned transactions, but I want to know what it will cost me if I can't avoid it. In summary, my question is whether it is still the case that multi-partitioned transactions in VoltDB always lock the entire cluster and how are the different kinds of multi-partitioned transactions are related to each other regarding to their execution behaviour?
From H-Store-FAQ:
[...] this allows H-Store to support additional optimizations, such as speculative execution and arbitrary multi-partition transactions. For example, in VoltDB every transaction is either single-partition or all-partition. That is, any transaction that needs to touch multiple partitions will cause the VoltDB’s transaction coordinator to lock all partitions in the cluster, even if the transaction only needs to touch data at two partitions. [...] It is likely VoltDB will support these features in the future [...]
The papers The VoltDB Main Memory DBMS and How VoltDB does Transactions claim that it exists at least one split of multi-partitioned transactions in VoltDB: One-Shot-Reads and General-2PC-Transactions.
In the class MpTransactionTaskQueue there is a distinction, whether a transaction will be routed to the multi-partitioned site (count 1) or a pool of read-only sites (default count up to 20) of the MPI and they can't be executed interleaved.
So these are my sub questions:
To make things easier for developers, procedures are called the same way regardless of what type they are. Internally there are different types of multi-partition procedures as they provide some optimizations, although there is more to be done and some H-Store projects have done research in these areas.
MP transactions still ultimately involve sending tasks to be done on all the partitions. The one exception you noticed is a special two-partition transaction that is only used in rebalancing data during elastic add or shrink.
Partitions consist of one or more sites (on separate servers) depending on kfactor. These sites stay in sync without a 2PC by requiring deterministic procedures. The partitions work through the backlog in a queue as fast as the process time (or local execution time) allows. All sites handle both reads and writes.
MP tasks sent to those partition queues have to wait on all the pending items to finish. That is why there is a pool of 20 (by default) threads for MP reads. This allows 20 tasks to be sent out at once, so that the next MP read usually doesn't have to wait for 2 networks hops + the max queue wait time + processing time before it can even get queued.
MP reads that are not "single-shot" would be Java procedures with multiple voltExecuteSQL() calls, such as a procedure where subsequent SQL queries depend on the results of prior queries. When these transactions send tasks to the partitions, the partitions have to wait for the max queue wait time + processing time + 2 network hops before they can do the next part of the transaction.
MP writes can also have multiple voltExecuteSQL() calls, plus they have to wait for a final commit signal, so this all delays the progress on the partitions.
There are certainly examples of MP transactions that shouldn't need to involve all of the partitions and could benefit from future optimizations, but it's not as easy as it may seem on a database that has to support durability to disk, k-safety, elastic add and shrink, multi-cluster active-active replication, and many of the other features that have been added to VoltDB over the years since it grew out of the H-Store project.
Disclosure: I work at VoltDB