sqlsql-serverquery-optimizationset-theory

Is it possible to rewrite any SQL query that uses temp tables, as a set-based query?


Often, when working with complicated queries in SQL, I use one or more temp tables, to break the problem down into smaller pieces: I would create a new temp table or (table variable), persist some data to the temp table, use that temp table to create another temp table, and so forth. Until finally, I select from the last temp table as my output. The reason for this, is that with a very complex problem, a single set-based operation to get the final answer, is not always obvious.

I know that, generally speaking, temp tables are considered a sub-optimal solution, and can often be optimized by rather writing a purely set-based query (JOINs, WHERE clause, subquery etc.)

Can you re-write any SQL query that uses temp tables, as a purely set-based query?


Solution

  • From the theoretical standpoint you can always write a query that is based in intermediate results, as a single query using CTEs (Common Table Expressions).

    For example, if you have three "real tables" X, Y, and Z, and you have 4 temporary tables to hold intermediate results, the process could look like:

    create table a as select ... from x join y join z;
    create table b as select ... from a join x; -- dependency on a
    create table c as select ... from y;
    create table d as select ... from a, b, c; -- dependency on a, b, and c
    select from a join b join c join d; -- final query
    

    The assembled query using CTES, can look like:

    with
    a as (select ... from x join y join z),
    b as (select ... from a join x),
    c as (select ... from y),
    d as (select ... from a, b, c)
    select from a join b join c join d;
    

    From the practical standpoint, having materialized intermediate tables does have some benefits. For one, you can inspect and debug more easily. Moreover, sometimes it can lead to increased performance if you index those materialized results appropriately.

    The downside is that you are consuming [a lot of] tablespace. Also, when combined as CTEs, some engines are smart enough to "untangle" and fully reorganize (the technical term is "rephrasing the query") the set of queries in different ways that may be faster to execute, that may have not even though about; some engines are really smart in this sense. They can also pipeline intermediate result sets, in a way that you cannot do if you materialize them.

    Overall, run your horses.

    If you have bandwidth, try both and see which one is better for you. Sometimes the ability to debug intermediate result is paramount. Other times, performance is the driving factor. All depends on your use case.