I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:
ą
ą
(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)
However, they are different:
Characteristics of the 1st character:
In PostgreSQL:
select ascii('ą');
ascii
-------
261
The UTF-8-encoding in Hex is: \xC4\x85
so it is a precomposed character (https://en.wikipedia.org/wiki/Precomposed_character)
Characteristics of the 2nd character:
In PostgreSQL:
select ascii('ą');
ascii
-------
97
(same as character 'a')
That strongly indicates that the rendered character is combined out of two characters. And it is indeed:
The UTF-8-encoding in Hex is: \x61\xCC\xA8
So it is a combination of
a \x61\
and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:
̨ \xCC\xA8
I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).
I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:
1st character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
1
1st character: expected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
1
2nd character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
2nd character: unexpected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
2
So, when I compare both characters with levenshtein and unaccent, the result is 1:
select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein
-------------
1
instead of 0.
How can I "get rid of the ogonek" in the 2nd case?
(How) can I use the UTF-8 codes of Strings to get the achieved result?
Edit: As @s-man suggested, adding the combining character to unaccent.rules
would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.
Removing accents will give you a Levenshtein distance of 0, but it will also give you a distance of 0 between ą
and a
, which does not sound ideal.
The better solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence E'a\u0328'
into the precomposed character E'\u0105'
before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
create extension plpythonu;
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;
And then:
test=# select levenshtein(unicode_normalize(E'a\u0328'), unicode_normalize(E'\u0105'));
levenshtein
-------------
0
This also solves the issue in your previous question, where the combining character was contributing to the Levenshtein distance:
test=# select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
1