sqlalgorithmcachingquery-cache

Implementation of QueryCache


Aside from doing a direct match on something like a whitespace normalized hash of a query, what might be a useful (but-not-necessarily-perfect) way to handle query cache in a partial manner? For example, let's take the following basic case:

SELECT
    Product,  # VARCHAR
    Revenue   # DOUBLE
FROM
    Sales
WHERE
    Country='US'

This potentially could be used as a 'base-cache' upon which a further query could be executed to potentially improve performance:

SELECT
    Product,  # VARCHAR
    Revenue   # DOUBLE
FROM
    Sales
WHERE
    Country='US' AND State='CA'

So, assuming the data in the from table(s) don't change, the following might serve as a starting point for determining cache:

However, this becomes quite tricky when we think about something like the following case:

SELECT
    ProductGroup AS Product,  # Would produce a Product:VARCHAR hash
    Revenue   
FROM
    Sales
WHERE
    Country='US'

What might be a realistic starting point for how a partial- query cache could be implemented.


Use case: writing SQL to query data in a non-DBMS-managed source, such as a CSV file which will take ~20s or so to issue any query and we cannot create indexes on the file. https://en.wikipedia.org/wiki/SQL/MED or Spark-like.


Solution

  • I think the following might be a good starting place for a basic cache implementation that allows the usage of a cache that can be further queried for refinements:

    1. Start by substituting any udf's or cte's. The query itself needs to be self-contained.
    2. Normalize whitespaces and capitalization.
    3. Hash the entire query. This will be our starting place.
    4. Remove the select fields and hash the rest of the query. Now store a hash of all the individual items in the select list.
    5. For partial cache, generate a hash minus select fields, where, sort, and limit+offset. Hash the where's list (separated by AND), making sure no filter is contained in the cache that is not contained in the current query, the orderby, seeing if the data needs to be re-sorted, and the limit+offset number, making sure the limit+offset in the initial query is null or greater than the current query.

    Here would be an example of how the data might look saved:

    Hash 673c0185c6a580d51266e78608e8e9b2
    HashMinusFields 41257d239fb19ec0ccf34c36eba1948e
    HashOfFields [dc99e4006c8a77025c0407c1fdebeed3, …]
    HashMinusFieldsWhereOrderLimit d50961b6ca0afe05120a0196a93726f5
    HashOfWheres [0519669bae709d2efdc4dc8db2d171aa, ...]
    HashOfOrder 81961d1ff6063ed9d7515a3cefb0c2a5
    LimitOffset null

    Now let's try a few examples, I will use human-readable hashes for easier readability:

    SELECT Name, Age FROM Sales WHERE id=2
    -- fullHash:   selectname,agefromsaleswhereid=2
    -- selectless: fromsaleswhereid=2
    -- hashoffields: [name, age]
    -- minusfieldswhereorderlimit: null
    -- hashofwheres:  [id=2, ]
    -- hashororder: null
    -- limitoffset: null
    
    -- query1
    select age FROM sales where id=2
    -- selectless: fromsaleswhereid=2
    -- fields: [age] OK, all fields contained in initial fields
    
    -- query2
    select age FROM sales where id=2 and country='us' order by id limit 100 
    -- minusfieldswhereorderlimit: null
    -- hashofwheres: [id=2, country=us] OK initial query does not contain any additional filters
    -- limitoffset: 100 OK initial limitoffset is null (infinity)
    -- hashorder: orderbyid
    --> Can grab partial cache, need to apply one filter and re-sort/limit:
        --> SELECT * FROM <cache> WHERE country='us' order by id limit 100
    

    Does the above seem like a valid initial implementation?