sqlsqliteindexing

How should I index for combinations of WHERE and ORDER BY clauses?


My function creates a query from dictionary input (fixed ordered keys):

input = {
    "A": ["string"],
    "B": [
          "1234",
          "4567"
    ],
    "C": ["string"],
    "D": ["string"]
}

Keys represent column names of an SQLite database table and value are the values to find. There is another column idx for primary key. I initially created indexes for all combinations of columns, but it was too much. Resulting queries:

SELECT * FROM result_table WHERE idx >= 1900 AND (B in (1234)) LIMIT 100
SELECT * FROM result_table WHERE idx >= 1900 AND (B in (1234,5678)) LIMIT 100
SELECT * FROM result_table WHERE idx >= 1900 AND (B in (1234) and D in ('somestring')) LIMIT 100
SELECT * FROM result_table WHERE idx >= 1900 AND (B in (1234,5678)) ORDER BY A ASC, ORDER BY D DESC LIMIT 100
SELECT * FROM result_table WHERE idx >= 1900 AND (B in (1234,5678) and C in ('somestring2')) ORDER BY A ASC, ORDER BY C DESC LIMIT 100

It's a mix of WHERE and ORDER BY clauses. I understood when I should create indexes but when I ORDER BY, SQLite does not use indexes anymore. The indexes I created:

(idx,) A, B, C, D
(idx,) A, C, D
(idx,) A, D
(idx,) B, C, D
(idx,) B, D
(idx,) C, D
(idx,) D

Which indexes should I create for queries with a combination of WHERE and ORDER BY clauses?


Solution

  • Queries 1 and 2 can use indexes B,C,D or B,D. Since idx is constrained by an inequality, it can be used only if it is immediately at the right of B in the index, not before. So B,idx would be a better choice, but not idx,B.

    Best index for query 3 would be B,D,idx (or D,B,idx). B,D should be the one chosen among the indexes you created.

    To execute query 4, sqlite can use the same index used in queries 1 or 2 to speed up the WHERE clause, or it could use index A,D to speed up the ORDER BY clause. Sqlite will make the choice based of what it thinks is the better plan and this choice will be made based on how many rows are in the table, how much selective is the WHERE BY clause (what percentage of rows will have a B value of 1234 or 5678) and how much selective is the LIMIT clause (are 100 rows a lot less than the total rows?). To correctly make these assumption, sqlite must have gathered statistics on the content of the table using ANALYZE.

    Similarly, query 5 will use index B,C,D (B,C,idx would be better) for the WHERE or index A,C,D for the ORDER BY

    Citing sqlite docs:

    SQLite uses a cost-based query planner. When there are two or more ways of solving the same query, SQLite tries to estimate the total amount of time needed to run the query using each plan, and then uses the plan with the lowest estimated cost. A cost is computed mostly from the estimated time, and so this case could go either way depending on the table size and what WHERE clause constraints were available, and so forth. But generally speaking, the indexed sort would probably be chosen, if for no other reason, because it does not need to accumulate the entire result set in temporary storage before sorting and thus uses much less temporary storage.

    As for your question, the best index is the one that can be used to satisfy both WHERE and ORDER BY clause.

    If there are much more than 100 records with a value of 1234 or 5678 for B, for query 4 the best one could be (A,D,B,idx). Otherwise, (B,A,D,idx) could be a better choice.

    For query 5 the best one could be (B,C,A,idx), (C,B,A,idx), (A,C,B,idx) or some other strange combination of these columns.

    My suggestion is to download latest sqlite.exe (the command line interface) and then use the .expert --sample 100 command followed by the SQL query on a separate line. The ".expert" command will propose indexes that might assist with those specific queries, were they present in the database (Reference: https://www.sqlite.org/cli.html#index_recommendations_sqlite_expert_).