Aside from doing a direct match on something like a whitespace normalized hash of a query, what might be a useful (but-not-necessarily-perfect) way to handle query cache in a partial manner? For example, let's take the following basic case:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US'
This potentially could be used as a 'base-cache' upon which a further query could be executed to potentially improve performance:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US' AND State='CA'
So, assuming the data in the from
table(s) don't change, the following might serve as a starting point for determining cache:
However, this becomes quite tricky when we think about something like the following case:
SELECT
ProductGroup AS Product, # Would produce a Product:VARCHAR hash
Revenue
FROM
Sales
WHERE
Country='US'
What might be a realistic starting point for how a partial- query cache could be implemented.
Use case: writing SQL to query data in a non-DBMS-managed source, such as a CSV file which will take ~20s or so to issue any query and we cannot create indexes on the file. https://en.wikipedia.org/wiki/SQL/MED or Spark-like.
I think the following might be a good starting place for a basic cache implementation that allows the usage of a cache that can be further queried for refinements:
Here would be an example of how the data might look saved:
Hash | 673c0185c6a580d51266e78608e8e9b2 |
---|---|
HashMinusFields | 41257d239fb19ec0ccf34c36eba1948e |
HashOfFields | [dc99e4006c8a77025c0407c1fdebeed3, …] |
HashMinusFieldsWhereOrderLimit | d50961b6ca0afe05120a0196a93726f5 |
HashOfWheres | [0519669bae709d2efdc4dc8db2d171aa, ...] |
HashOfOrder | 81961d1ff6063ed9d7515a3cefb0c2a5 |
LimitOffset | null |
Now let's try a few examples, I will use human-readable hashes for easier readability:
SELECT Name, Age FROM Sales WHERE id=2
-- fullHash: selectname,agefromsaleswhereid=2
-- selectless: fromsaleswhereid=2
-- hashoffields: [name, age]
-- minusfieldswhereorderlimit: null
-- hashofwheres: [id=2, ]
-- hashororder: null
-- limitoffset: null
-- query1
select age FROM sales where id=2
-- selectless: fromsaleswhereid=2
-- fields: [age] OK, all fields contained in initial fields
-- query2
select age FROM sales where id=2 and country='us' order by id limit 100
-- minusfieldswhereorderlimit: null
-- hashofwheres: [id=2, country=us] OK initial query does not contain any additional filters
-- limitoffset: 100 OK initial limitoffset is null (infinity)
-- hashorder: orderbyid
--> Can grab partial cache, need to apply one filter and re-sort/limit:
--> SELECT * FROM <cache> WHERE country='us' order by id limit 100
Does the above seem like a valid initial implementation?