Have been fiddling with jOOQ and its MULTISET feature. But was a bit dissappointed by the performance of the queries. I assume I'm doing something suboptimal, or perhaps the JSON de-serializing is more expensive than expected.
Given the popular Sakila DB and these two queries:
val result: Result<Record3<CustomerId, Int, Int>> = dslContext
.select(
CUSTOMER.CUSTOMER_ID,
PAYMENT.PAYMENT_ID,
RENTAL.RENTAL_ID,
)
.from(CUSTOMER)
.join(PAYMENT).on(PAYMENT.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
.join(RENTAL).on(RENTAL.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
.orderBy(CUSTOMER.EMAIL)
.fetch()
and
val result: Result<Record3<CustomerId, Result<Record1<Int>>, Result<Record1<Int>>>> = dslContext
.select(
CUSTOMER.CUSTOMER_ID,
multiset(
DSL.select(PAYMENT.PAYMENT_ID)
.from(PAYMENT)
.where(PAYMENT.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
),
multiset(
DSL.select(RENTAL.RENTAL_ID)
.from(RENTAL)
.where(RENTAL.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
),
)
.from(CUSTOMER)
.fetch()
Measuring the number of row and time it takes leads to the following:
445483 Records via JOIN. in 466.389250ms
599 Records via MULTISET in 747.627541ms
Which is rather disappointing... even if I select more data from CUSTOMER
table resulting in more duplicate data transfer in the JOIN case, it doesn't change much in MULTISETs favor.
Am I missing something?
As illustrated in this blog post, MULTISET
and the underlying JSON aggregation techniques are well suited for small nested collections, not large ones, in case of which there are better optimised approaches, including:
Your example query doesn't really represent a real world use-case. You're just fetching all the data of the parent table, as well as all the data of the child tables, which is hardly ever done this way, and if it is, then you probably don't have this nesting use-case, but just flat export your data into a CSV, or whatever. In real-world use-cases, MULTISET
can often outperform equivalent JOIN tables, but you'll have to consider execution plans (as always with SQL, not just with MULTISET
!). Why not measure a real-world use-case instead? E.g. 5 rows in the parent table, and the top 10 rows per child.
If you had done any further measurements, profiling, etc., you would have seen that the overhead of JSON serialisation / deserialisation, while present, no doubt, is still marginal compared to the effect of the inferior execution plan. Most RDBMS still produce nested loops of sorts when collecting data from subqueries into JSON documents, whereas joins can profit from hash joins or merge joins, which have better algorithmic complexity. Just like nested loop joins are inferior to hash joins for large data sets, nesting collections can be inferior as well.
As always with SQL, don't try to establish set-in-stone rules about which approach is better than the other, but try to understand why things are the way they are, and frequently verify your assumptions, as optimisers tend to get better, and there's no reason for RDBMS not to eventually find ways to transform nested collection loops like these into some sort of hashed collection. jOOQ recommends techniques like these benchmarking approaches for measurements of various equivalent alternative queries.