I have a parsed representation of an SQL query in the form of an abstract syntax tree. I'd like to convert it into an intermediate logical tree representation, which can in turn be expanded into candidate physical operation trees by the optimizer.
Following various resources discussing the topic (CMU lecture, Database Management System, Dayal's 1987 paper on nested subqueries), I managed to translate most of the query into an intermediate algebraic form (e.g. π_a(R⋈S)
) but I get stuck at translating common table expressions and lateral joins.
SELECT *
FROM
(SELECT 1 id) A,
(
(SELECT 2) B
JOIN LATERAL (SELECT A.id) C ON TRUE
) D;
NAT. JOIN
├── A
└── LATERAL JOIN (D)
├── B
└── C <- should be able to reference A
Can these constraints be expressed in terms of algebraic operations or do they typically need bespoke data structures? For example, separate subtrees for CTEs instead of a single tree for the whole logical plan - but then, how can the optimizer optimize the query globally instead of each subtree individually?
More generally, are there better intermediate representations than algebraic trees for the purpose of query optimization?
While algebraic trees are quite common as intermediate representations, they may not capture all the complexities of modern SQL queries. Using more flexible representations such as DAGs, augmenting trees with annotations, and maintaining logical/physical separation can help address the challenges posed e.g. by CTEs and lateral joins.
For instance:
At to inlining CTEs vs. using subtrees: inlining CTEs into the algebraic tree can lead to redundancy and inefficiency, as you've rightly pointed out. However, in some cases, inlining might be the simplest approach for optimization. For more complex queries or for optimization across multiple queries, maintaining a separate representation for CTEs might be beneficial. This could involve storing the CTEs separately and referencing them as needed in the main query plan.
For further reading on the topic and in addition to the sources you already mentioned, I would recommend "System R: Relational Approach to Database Management" by Chamberlin, D.D., et al. which discusses the System R project and its internals, "Volcano: An Extensible and Parallel Query Evaluation System" by Graefe, G., as well as Postgres' parser and optimiser implementations.