python-3.xsqlitefull-text-searchfts5

How to enable FTS5 search a string with ".", "_" and "0-9"?


I have a table holding 300K records of strings using alphanumeric, digit, dot, underscore and brackets [].

I use FTS5 extension to sqlite3 to enable fast search over that table. This is how i create the FTS virtual table:

database = sqlite3.connect("mydb.db")
db_cursor = database.cursor()
db_cursor.execute("create virtual table field_names USING fts5 (full_path)")

I am adding the ~300K records using the below code in a loop:

database.execute("insert into field_names(full_path) values (?)", (field_path,))

sample records:

a.extbootrecord.field_db0    
a.extbootrecord.field_db1    
a.extbootrecord.field_db8    
a.extbootrecord.field_db9    
a.extbootrecord.field_db10   
a.extbootrecord.field_db11   
a.extbootrecord.field_db12   
a.extbootrecord.field_db15   

Using the following query:

db_cursor.execute("select full_path from field_names where field_names = '\"%s\"'" % search_phrase)
return_list = list()

entries = db_cursor.fetchmany(100)
while entries:
    return_list.extend([entry[0] for entry in entries])
    entries = db_cursor.fetchmany(100)

with the following search_phrase yields the following:

  1. ext : nothing
  2. extbootrecord : all the records
  3. extbootrecrd. : all the records
  4. extbootrecord.fie : nothing
  5. extbootrecord.field : all the records
  6. extbootrecord.field_db1 : only a.extbootrecord.field_db1 record, I would expect field_db1, field_db10, field_db11... to be returned

Seems like I am missing some FTS configuration to consume ., _ and 0-9 as valid characters as part of a token.

I tried configurating the FTS tokenizer using tokenize = \"unicode61 tokenchars '_.'\" in the creation statement, but with no luck.

What am I missing?


Solution

  • Here's a full example to make sense how tokenchars works because I think the syntax is rather subtle and easy to get wrong.

    Let's start by creating a test environment in a shell:

    $ sqlite3 ":memory:"
    

    Now lets create an fts5 table that will allow the period:

    sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS foo USING fts5(name UNINDEXED, keywords, tokenize="unicode61 tokenchars '.'");
    

    Notice how the tokenize value is setup, you wrap double quotes around the value and single quotes around the tokenchars you want to add (we're just adding the period here but you can add as many characters as you want).

    With our table ready to go we're ready to insert some values:

    sqlite> INSERT INTO foo (name, keywords) VALUES ('bar', '1.0');
    sqlite> INSERT INTO foo (name, keywords) VALUES ('che', '1.5');
    sqlite> INSERT INTO foo (name, keywords) VALUES ('baz', '2.5');
    

    And search for those values:

    sqlite> SELECT name from foo WHERE keywords MATCH '"1."*';
    bar
    che
    sqlite> SELECT name from foo WHERE keywords MATCH '"1.5"*';
    che
    sqlite> SELECT name from foo WHERE keywords MATCH '"2"*';
    baz 
    

    Notice how we have to search for the values if we have a period in our search string, we have to wrap our search string in double quotes (then add the star outside of those double quotes if we want to do a prefix search) and then wrap the whole string in single quotes as per usual.

    If we don't use double quotes:

    sqlite> SELECT name from foo WHERE keywords MATCH '1.*';
    Error: fts5: syntax error near "."
    

    Or we use double quotes incorrectly:

    sqlite> SELECT count(*) from foo WHERE keywords MATCH '1"."*';
    0
    

    Then we will get vague unhelpful errors and unexpected results, which will lead us to search the internet trying to figure out what we're doing wrong and stumble on this question :)