I have a table holding 300K records of strings using alphanumeric, digit, dot, underscore and brackets [].
I use FTS5 extension to sqlite3 to enable fast search over that table. This is how i create the FTS virtual table:
database = sqlite3.connect("mydb.db")
db_cursor = database.cursor()
db_cursor.execute("create virtual table field_names USING fts5 (full_path)")
I am adding the ~300K records using the below code in a loop:
database.execute("insert into field_names(full_path) values (?)", (field_path,))
sample records:
a.extbootrecord.field_db0
a.extbootrecord.field_db1
a.extbootrecord.field_db8
a.extbootrecord.field_db9
a.extbootrecord.field_db10
a.extbootrecord.field_db11
a.extbootrecord.field_db12
a.extbootrecord.field_db15
Using the following query:
db_cursor.execute("select full_path from field_names where field_names = '\"%s\"'" % search_phrase)
return_list = list()
entries = db_cursor.fetchmany(100)
while entries:
return_list.extend([entry[0] for entry in entries])
entries = db_cursor.fetchmany(100)
with the following search_phrase
yields the following:
ext
: nothingextbootrecord
: all the recordsextbootrecrd.
: all the records extbootrecord.fie
: nothingextbootrecord.field
: all the recordsextbootrecord.field_db1
: only a.extbootrecord.field_db1
record, I would expect field_db1, field_db10, field_db11... to be returnedSeems like I am missing some FTS configuration to consume .
, _
and 0-9
as valid characters as part of a token.
I tried configurating the FTS tokenizer using tokenize = \"unicode61 tokenchars '_.'\"
in the creation statement, but with no luck.
What am I missing?
Here's a full example to make sense how tokenchars
works because I think the syntax is rather subtle and easy to get wrong.
Let's start by creating a test environment in a shell:
$ sqlite3 ":memory:"
Now lets create an fts5 table that will allow the period:
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS foo USING fts5(name UNINDEXED, keywords, tokenize="unicode61 tokenchars '.'");
Notice how the tokenize
value is setup, you wrap double quotes around the value and single quotes around the tokenchars
you want to add (we're just adding the period here but you can add as many characters as you want).
With our table ready to go we're ready to insert some values:
sqlite> INSERT INTO foo (name, keywords) VALUES ('bar', '1.0');
sqlite> INSERT INTO foo (name, keywords) VALUES ('che', '1.5');
sqlite> INSERT INTO foo (name, keywords) VALUES ('baz', '2.5');
And search for those values:
sqlite> SELECT name from foo WHERE keywords MATCH '"1."*';
bar
che
sqlite> SELECT name from foo WHERE keywords MATCH '"1.5"*';
che
sqlite> SELECT name from foo WHERE keywords MATCH '"2"*';
baz
Notice how we have to search for the values if we have a period in our search string, we have to wrap our search string in double quotes (then add the star outside of those double quotes if we want to do a prefix search) and then wrap the whole string in single quotes as per usual.
If we don't use double quotes:
sqlite> SELECT name from foo WHERE keywords MATCH '1.*';
Error: fts5: syntax error near "."
Or we use double quotes incorrectly:
sqlite> SELECT count(*) from foo WHERE keywords MATCH '1"."*';
0
Then we will get vague unhelpful errors and unexpected results, which will lead us to search the internet trying to figure out what we're doing wrong and stumble on this question :)