pythonsqlitestring-comparisonsimilarity

String similarity with Python + Sqlite (Levenshtein distance / edit distance)


Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module?

Example of use case:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

This query should match the row with ID 1, but not the row with ID 2:

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

How to do this in Sqlite+Python?

Notes about what I've found so far:


Solution

  • Here is a ready-to-use example test.py:

    import sqlite3
    db = sqlite3.connect(':memory:')
    db.enable_load_extension(True)
    db.load_extension('./spellfix')                 # for Linux
    #db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
    db.enable_load_extension(False)
    c = db.cursor()
    c.execute('CREATE TABLE mytable (id integer, description text)')
    c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
    c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
    c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
    print c.fetchall()
    # Output: [(1, u'hello world, guys')]
    

    Important note: The distance editdist3 is normalized so that

    the value of 100 is used for insertion and deletion and 150 is used for substitution


    Here is what to do first on Windows:

    1. Download https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them

    2. Replace C:\Python27\DLLs\sqlite3.dll by the new sqlite3.dll from here. If skipping this, you'd get a sqlite3.OperationalError: The specified procedure could not be found later

    3. Run:

      call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"  
      

      or

      call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
      cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
      python test.py
      

      (With MinGW, it would be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

    Here is how to do it on Linux Debian:

    (based on this answer)

    apt-get -y install unzip build-essential libsqlite3-dev
    wget https://sqlite.org/2016/sqlite-src-3110100.zip
    unzip sqlite-src-3110100.zip
    gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
    python test.py
    

    Here is how to do it on Linux Debian with an older Python version:

    If your distribution's Python is a bit old, it will require another method. As sqlite3 module is built-in in Python, it seems not straightforward to upgrade it (pip install --upgrade pysqlite would only upgrade the pysqlite module, not the underlying SQLite library). Thus this method works for example if import sqlite3; print sqlite3.sqlite_version is 3.8.2:

    wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
    tar xvfz SQLite-27392118.tar.gz
    cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
    gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
    python test.py   # [(1, u'hello world, guys')]