[SOLVED] How to compose affinities with sql queries in apache ignite 2.x

How to compose affinities with sql queries in apache ignite 2.x

I am trying to implement affinity for sql queries with Apache Ignite. Let's imagine I have the following classes (the code is available at https://github.com/hostettler/distibuted-queries)

//I do not put here all the configuration, but it is available in my small test if interested
class EntityAKey {
   @AffinityKeyMapped
   String id;
}

class EntityBKey {
   String id;
   @AffinityKeyMapped
   String entityAId;
}

...
String joinSql = "select count(1) from \"EntityA\".EntityA as a inner join \"EntityB\".EntityB as b on a.id = b.entityA ";
SqlFieldsQuery q1 = new SqlFieldsQuery(joinSql);
q1.setDistributedJoins(false);
FieldsQueryCursor<List<?>> c = cacheA.query(q1);
List<List<?>> l = c.getAll();

In this case (A <--- B), everything is fine I can even instruct the sql engine to work per group of partitions and I can set setDistributedJoins(false) to instruct the engine that it does not have to look at other partitions.

Now, that works as long as ALL the joins of the query are relying on exactly the SAME affinity. So assuming I am adding the classes C and D such as (C<---D) and A has a relationship to C such as

class EntityAKey {
   @AffinityKeyMapped
   String id;
}
class EntityA {
   String id;
   String entityC; //Id of entityC but THIS IS NOT PART OF THE KEY 
}

class EntityBKey {
   String id;
   @AffinityKeyMapped
   String entityAId;
}

class EntityCKey {
   @AffinityKeyMapped
   String id;
}

class EntityDKey {
   String id;
   @AffinityKeyMapped
   String entityCId;
}

String joinSql = "select count(1) from \"EntityA\".EntityA as a " 
+ " inner join \"EntityB\".EntityB as b on a.id = b.entityA" 
+ " inner join \"EntityC\".EntityC as c on a.entityC = c.id"
+ " inner join \"EntityD\".EntityD as d on d.entityC = a.entityC";
SqlFieldsQuery q1 = new SqlFieldsQuery(joinSql);
q1.setDistributedJoins(false); //only works if setDistributedJoins(true)
q1.setPageSize(100);
FieldsQueryCursor<List<?>> c = cacheA.query(q1);
List<List<?>> l = c.getAll();

In that case the result incorrect (as mentionned in the documentation) simply because we have two different affinities -- on A(id) and on C(id) -- thus the join between A and B works fine and same between C and D but obviously not the whole query.

All that long introduction for the actual questions:

I can come up with the following solutions but I do not like any of them. I would the advise of the community about them:

a) set setDistributedJoins(true) works just fine but my understanding is that in that case the query is not done on multiple threads (within a node) as it cannot rely on affinity to prune partitions. Is that understanding correct?

b) split the query in two different queries one from A->B and C->D with setDistributedJoins(false) and then reconcile the 2 queries with a third query with setDistributed(true). This is not practical for real life queries having 10-15 joins on potentially 3-4 different type of affinities

c) Changing the model so that there is the affinity everywhere by artificially having say entityA id in entityC. This is not practical because it means that cache.get will become much more cumbersome as I do not know the entityA id

Would you agree that the Apache Ignite optimizer should do that pruning for me and should apply affinity where possible?
Will that be solved by the adoption of Calcite?

Thanks a lot for your help.

Solution

Using distributed joins is the symptom, not the cause. The underlying cause is that your data is not co-located, so it needs to shuffle data over the network. For an in-memory database, having to hit the network is a Bad Thing
It does, but it can't "apply affinity" if there is none. Your join between entity A and entity C is not colocated
No, because this is about the location of the data. The Calcite engine in Ignite is still in beta, so I'm not sure you can draw any conclusions about the performance. It should allow better performance in the future

In short, you can:

Co-locate your data
Use distributed joins
Rewrite your query. It may be possible to use a subselect or CTE to avoid the "shuffle" phase of a distributed join