UPDATE: In the end I used Java6 Normalizer to find out which characters are extensions of a-zA-Z. So now all weird characters get translated into those 50 ASCII letters. No noticable slowing down when typing/autocompleting.
Using what algorithm does GAE Search API process strings?
For optimization purposes (within browser) I need to mimic whatever processing is done to the "needle" string before it is matched against the indexes. Basically it means to translate "weird" characters into their "boring" (and lowercase) representations:
Is there some standardized (or at least "well known") translation table so I don't miss some characters?
In the end I hard-coded a Map where key is the "plain" character and value contains string concatenating all of key's "weird" versions. (In Java every "weird" character knows what is its "plain" counterpart.)
In Java you can make the translation like this:
String dropAccents(String weirdCharacter) {
return java.text.Normalizer.normalize(weirdCharacter, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
...and you call this for characters of 65..91 (upper case) and 97..123 (lower case)
JavaScript/Java code that initializes such map has roughly 50 rather short lines.
translationTable.put("A", "ÀÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÅ");
translationTable.put("B", "ḂḄḆ");
translationTable.put("C", "ÇĆĈĊČḈ");
translationTable.put("D", "ĎḊḌḎḐḒ");
translationTable.put("E", "ÈÉÊËĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆ");
translationTable.put("F", "Ḟ");
translationTable.put("G", "ĜĞĠĢǦǴḠ");
translationTable.put("H", "ĤȞḢḤḦḨḪ");
translationTable.put("I", "ÌÍÎÏĨĪĬĮİǏȈȊḬḮỈỊ");
translationTable.put("J", "Ĵ");
translationTable.put("K", "ĶǨḰḲḴK");
translationTable.put("L", "ĹĻĽḶḸḺḼ");
translationTable.put("M", "ḾṀṂ");
translationTable.put("N", "ÑŃŅŇǸṄṆṈṊ");
translationTable.put("O", "ÒÓÔÕÖŌŎŐƠǑǪǬȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢ");
translationTable.put("P", "ṔṖ");
translationTable.put("R", "ŔŖŘȐȒṘṚṜṞ");
translationTable.put("S", "ŚŜŞŠȘṠṢṤṦṨ");
translationTable.put("T", "ŢŤȚṪṬṮṰ");
translationTable.put("U", "ÙÚÛÜŨŪŬŮŰŲƯǓǕǗǙǛȔȖṲṴṶṸṺỤỦỨỪỬỮỰ");
translationTable.put("V", "ṼṾ");
translationTable.put("W", "ŴẀẂẄẆẈ");
translationTable.put("X", "ẊẌ");
translationTable.put("Y", "ÝŶŸȲẎỲỴỶỸ");
translationTable.put("Z", "ŹŻŽẐẒẔ");
translationTable.put("a", "àáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ");
translationTable.put("b", "ḃḅḇ");
translationTable.put("c", "çćĉċčḉ");
translationTable.put("d", "ďḋḍḏḑḓ");
translationTable.put("e", "èéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ");
translationTable.put("f", "ḟ");
translationTable.put("g", "ĝğġģǧǵḡ");
translationTable.put("h", "ĥȟḣḥḧḩḫẖ");
translationTable.put("i", "ìíîïĩīĭįǐȉȋḭḯỉị");
translationTable.put("j", "ĵǰ");
translationTable.put("k", "ķǩḱḳḵ");
translationTable.put("l", "ĺļľḷḹḻḽ");
translationTable.put("m", "ḿṁṃ");
translationTable.put("n", "ñńņňǹṅṇṉṋ");
translationTable.put("o", "òóôõöōŏőơǒǫǭȍȏȫȭȯȱṍṏṑṓọỏốồổỗộớờởỡợ");
translationTable.put("p", "ṕṗ");
translationTable.put("r", "ŕŗřȑȓṙṛṝṟ");
translationTable.put("s", "śŝşšșṡṣṥṧṩ");
translationTable.put("t", "ţťțṫṭṯṱẗ");
translationTable.put("u", "ùúûüũūŭůűųưǔǖǘǚǜȕȗṳṵṷṹṻụủứừửữự");
translationTable.put("v", "ṽṿ");
translationTable.put("w", "ŵẁẃẅẇẉẘ");
translationTable.put("x", "ẋẍ");
translationTable.put("y", "ýÿŷȳẏẙỳỵỷỹ");
translationTable.put("z", "źżžẑẓẕ");