sqlsqlitefull-text-searchfts4

Why does `ON (fts.text MATCH 'word' AND fts.id = item.id)` and `...WHERE fts.text MATCH 'word'` have the same query plan?


I have the typical small sqlite database with 3 tables, one for items (which are manga, and have ids), another for tags (|id|name|), and another with the associations between them (|tag_id| manga_id|) So now I need a way to search for items by title and get them with the tags. Kinda like this:

| title     | author          | tags                    |
|-----------+-----------------+-------------------------|
| Mushishi  | Shuichi Shigeno | supernatural, fantasy   |
| Initial D | Yuki Urushibara | racing, sports, deja vu |

So I also decided to use sqlite built in fts virtual table. Everything it contains are manga titles and their ids.

I actually managed to come up with a query for this, but I'm wary of it:

SELECT manga.title, GROUP_CONCAT(tag.name) tags FROM manga
JOIN mangafts fts ON fts.manga_id = manga.id
JOIN manga_tag_association ass ON ass.manga_id = manga.id
JOIN tag ON tag.id = ass.tag_id
WHERE fts.title MATCH 'mushishi' GROUP BY manga.id;

Because I would expect it to look up in the fts table first, then join based on the found ids, but the query plan is as follows:

QUERY PLAN
|--SCAN manga
|--SEARCH ass USING AUTOMATIC COVERING INDEX (manga_id=?)
|--SEARCH tag USING INTEGER PRIMARY KEY (rowid=?)
`--SCAN fts VIRTUAL TABLE INDEX 3:

I tried changing the query to this

SELECT manga.title, GROUP_CONCAT(tag.name) tags FROM manga
JOIN mangafts fts
  ON (fts.title MATCH 'mushishi' AND fts.manga_id = manga.id)
JOIN manga_tag_association ass ON ass.manga_id = manga.id
JOIN tag ON tag.id = ass.tag_id
GROUP BY manga.id;

Yet the query plan is exactly the same.

I actually have several questions:

  1. Why is it scanning the manga table, and why first?
  2. Why isn't it scanning the fts table first? The entire point I have it is to speed up my searches.
  3. Am I doing something wrong, based on what I need?

Edit: although it doesn't affect the plan, I've realized I'm supposed to write the match search as fts_table_name MATCH 'column: text to search' instead, not as I did above.

Edit 2: Ok I don't know why the previous snippets have such a plan, but I rewrote it from scratch because I realized there could be manga items that don't have associated tags and they wouldn't show up with those joins. I'm leaving this info here just in case someone else finds it useful or is learning like me :)

SELECT manga.id, manga.title, GROUP_CONCAT(tag.name) AS tags FROM manga
LEFT JOIN manga_tag_association ass ON ass.manga_id = manga.id
LEFT JOIN tag ON tag.id = ass.tag_id
JOIN mangafts ON mangafts.manga_id = manga.id
WHERE mangafts MATCH 'title: mushishi' GROUP BY manga.id;

and now the plan is this:

QUERY PLAN
|--SCAN mangafts VIRTUAL TABLE INDEX 4:
|--SEARCH manga USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN ass LEFT-JOIN
|--SEARCH tag USING INTEGER PRIMARY KEY (rowid=?) LEFT-JOIN
`--USE TEMP B-TREE FOR GROUP BY

Solution

  • Finding the best query plan is not easy, and sometimes we (humans) make assumptions based on informations that are not available to the query optimizer. What for you is a manga, a tag or a text index, for the query planner are just tables A, B, C, D, joined by some fields.

    You assume that sqlite should be scanning the FTS table first, because YOU KNOW that it will filter out most of the resulting rows. But a search in an FTS virtual table is more complex than scanning a normal "real" table, so sqlite will probably try to search it the lowest possible.

    Also, GROUP BY (and ORDER BY) are expensive operations, the more expensive the more rows have to be sorted/grouped, so if a plan can avoid doing those operations, sqlite will try to avoid them.

    Since you are grouping by manga.id, scanning that table will avoid the need for a separate sort (I suppose id is an integer primary key, so the table is already sorted by id). YOU KNOW that every manga has tag associations, and is present in mangafts, but sqlite doesn't know it and it could presume that the JOINs will reduce the number of total rows, thus needing less searches of mangafts. Also, it could decide that searching mangafts by id is faster that searching by title.

    When you change your query and add some LEFT JOINS, now sqlite knows that those LEFT JOINs will not reduce the number of rows, and can presume that it is better to search mangafts by title, then get the corrisponding rows from manga, even if this means that it has to later sort them for the group by.

    All these evaluations can change based on the number of rows in each table and the selectivity of the join conditions. ANALYZE will get some data from the table, that the query planner can use to calculate the faster plan, but sometimes even those numbers cannot tell the query planner what YOU know about your data.

    Maybe Sqlite is choosing this plan because you only have a thousand of manga in your db, and this plan is fast as any other for such a low number of records. But if you had a million manga in it, sqlite could calculate a better plan even with your original query.

    As NickW noted in the comments, it is pointless to try to optimize a query plan if there is no performance problem, because if and when the problem should arise, the conditions and the query plan could be very different from what you are seeing now.