sqlitetclhebrewunicode-normalization

Unicode normalization of Hebrew in Tcl, SQLite, or whatever else will work in Linux OS


I'm trying to perform joins in SQLite on Hebrew words including vowel points and cantillation marks and it appears that the sources being joined built the components in different orders, such that the final strings/words appear identical on the screen but fail to match when they should. I'm pretty sure all sources are UTF-8.

I don't see a built in method of unicode normalization in SQLite, which would be the easiest solution; but found this link of Tcl Unicode but it looks a bit old using Tcl 8.3 and Unicode 1.0. Is this the most up-to-date method of normalizing unicode in Tcl and is it appropriate for Hebrew?

If Tcl doesn't have a viable method for Hebrew, is there a preferred scripting language for handling Hebrew that could be used to generate normalized strings for joining? I'm using Manjaro Linux but am a bit of a novice at most of this.

I'm capable enough with JavaScript, browser extensions, and the SQLite C API to pass the data from C to the browser to be normalized and back again to be stored in the database; but I figured there is likely a better method. I refer to the browser because I assume that they area kept most up to date for obvious reasons.

Thank you for any guidance you may be able to provide.


I used the following code in attempt to make the procedure provided by @DonalFellows a SQLite function such that it was close to not bringing the data into Tcl. I'm not sure how SQLite functions really work in that respect but that is why I tried it. I used the foreach loop solely to print some indication that the query was running and progressing because it took about an hour to complete.

However, that's probably pretty good for my ten-year old machine and the fact that it ran on 1) the Hebrew with vowel points, 2) with vowel points and cantillation marks and 3) the Septuagint translation of the Hebrew for all thirty-nine books of the Old Testament, and then two different manuscripts of Koine Greek for all twenty-seven books of the New Testament in that hour.

I still have to run the normalization on the other two sources to know how effective this is overall; however, after running it on this one which is the most involved of the three, I ran the joins again and the number of matches nearly doubled.

proc normalize {string {form nfc}} {
    exec uconv -f utf-8 -t utf-8 -x "::$form;" << $string
}
# Arguments are: dbws function NAME ?SWITCHES? SCRIPT
dbws function normalize -returntype text -deterministic -directonly { normalize }

foreach { b } { 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 } {
  puts "Working on book $b"
  dbws eval { update src_original set uni_norm = normalize(original) where book_no=$b }
  puts "Completed book $b"
}

Solution

  • If you're not in a hurry, you can pass the data through uconv. You'll need to be careful when working with non-normalized data though; Tcl's pretty aggressive about format conversion on input and output. (Just… not about normalization; the normalization tables are huge and most code doesn't need to do it.)

    proc normalize {string {form nfc}} {
        exec uconv -f utf-8 -t utf-8 -x "::$form;" << $string
    }
    

    The code above only really works on systems where the system encoding is UTF-8… but that's almost everywhere that has uconv in the first place.

    Useful normalization forms are: nfc, nfd, nfkc and nfkd. Pick one and force all your text to be in it (ideally on ingestion into the database… but I've seen so many broken DBs in this regard that I suggest being careful).