I recently tried to execute a query in Apache Calcite using three CSV files as tables
This is the query which is executed:
EXPLAIN PLAN FOR SELECT COUNT(*) as NUM
FROM TTLA_ONE A
INNER JOIN TTLR_ONE B1 ON A.X = B1.X
INNER JOIN TTLR_ONE B2 ON B2.X = B1.X
INNER JOIN EMPTY_T C1 ON C1.X = B2.Y
INNER JOIN EMPTY_T C2 ON C2.X = C2.X
The result of the query is always zero because we are joining with an empty table. The plan obtained is:
EnumerableAggregate(group=[{}], NUM=[COUNT()])
EnumerableJoin(condition=[=($1, $4)], joinType=[inner])
EnumerableJoin(condition=[=($0, $1)], joinType=[inner])
EnumerableInterpreter
BindableTableScan(table=[[STYPES, TTLA_ONE]])
EnumerableCalc(expr#0..1=[{inputs}], X=[$t0])
EnumerableInterpreter
BindableTableScan(table=[[STYPES, TTLR_ONE]])
EnumerableJoin(condition=[=($1, $3)], joinType=[inner])
EnumerableJoin(condition=[true], joinType=[inner])
EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1])
EnumerableInterpreter
BindableTableScan(table=[[STYPES, EMPTY_T]])
EnumerableInterpreter
BindableTableScan(table=[[STYPES, EMPTY_T]])
EnumerableInterpreter
BindableTableScan(table=[[STYPES, TTLR_ONE]])
It is possible to note that the empty tables are used in the plan at the very end.
I add an example on this test code.
I dig more into the code and I switched on the log to debug and I have seen that all the tables rows are estimated as 100, but this is not true.
Below, It is possible to find the plan estimation with the logs set in debug mode:
EnumerableJoin(condition=[=($1, $4)], joinType=[inner]): rowcount = 3.0375E7, cumulative cost = {3.075002214917643E7 rows, 950.0 cpu, 0.0 io}, id = 26284
EnumerableJoin(condition=[=($0, $1)], joinType=[inner]): rowcount = 1500.0, cumulative cost = {2260.517018598809 rows, 400.0 cpu, 0.0 io}, id = 26267
EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26260
BindableTableScan(table=[[STYPES, TTLA_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7789
EnumerableCalc(expr#0..1=[{inputs}], X=[$t0]): rowcount = 100.0, cumulative cost = {150.0 rows, 350.0 cpu, 0.0 io}, id = 26290
EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26263
BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791
EnumerableJoin(condition=[=($1, $3)], joinType=[inner]): rowcount = 135000.0, cumulative cost = {226790.8015771949 rows, 550.0 cpu, 0.0 io}, id = 26282
EnumerableJoin(condition=[true], joinType=[inner]): rowcount = 9000.0, cumulative cost = {9695.982870329724 rows, 500.0 cpu, 0.0 io}, id = 26277
EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1]): rowcount = 90.0, cumulative cost = {140.0 rows, 450.0 cpu, 0.0 io}, id = 26288
EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26270
BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26275
BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26280
BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791
We can definitely see that for each table the estimation is always 100 rowcount = 100.0
.
The query is executed correctly but the plan is not optimised. Does anyone know why the table statistics are not correctly evaluated?
The problem is that in the class CsvTable
it is necessary to override the getStatistic
property method, by doing something like this:
private Statistic statistic;
// todo: assign statistics
@Override
public Statistic getStatistic() {
return statistic;
}
maybe pass these statistics from the constructor or inject some object which generates them.
At the moment it returns just Statistics.UNKNOWN
which is in the superclass implementation AbstractTable`.
Of course without statistics, the estimated cost for the plan is not correct.