Is there a more or less standard way to transliterate Polish alphabet with the original ASCII (US-ASCII) characters?
This question can be broken in two related and more precise questions:
I can see that most Polish websites just remove the diacritics in their URLs. For example:
Świętosław Milczący → Swietoslaw Milczacy
Dzierżykraj Łaźniński → Dzierzykraj Lazninski
Józef Soćko → Jozef Socko
This is hardly reversible, but is it the most readable transliteration for Polish readers?
In some other cases, more complicated ad hoc transliteration might be used, like Wałęsa → Wawensa
. Are there any standard rules for doing this latter kind of transformations?
P.S. Just to clarify, I'm interested in transliteration rules (like ł → w
, ę → en
), not the implementation. Something like this table.
You could encode presense of diacritics as some kind of ternary number, and store them near the plain ASCII transliteration to make it reversible.
URLs often contain some additional IDs, even this one: 48686148/how-to-transliterate-polish-alphabet-with-us-ascii
Here is example implementation:
trans_table = {
'A': ('A', 0), 'a': ('a', 0),
'Ą': ('A', 1), 'ą': ('a', 1),
'B': ('B', 0), 'b': ('b', 0),
'C': ('C', 0), 'c': ('c', 0),
'Ć': ('C', 1), 'ć': ('c', 1),
'D': ('D', 0), 'd': ('d', 0),
'E': ('E', 0), 'e': ('e', 0),
'Ę': ('E', 1), 'ę': ('e', 1),
'F': ('F', 0), 'f': ('f', 0),
'G': ('G', 0), 'g': ('g', 0),
'H': ('H', 0), 'h': ('h', 0),
'I': ('I', 0), 'i': ('i', 0),
'J': ('J', 0), 'j': ('j', 0),
'K': ('K', 0), 'k': ('k', 0),
'L': ('L', 0), 'l': ('l', 0),
'Ł': ('L', 1), 'ł': ('l', 1),
'M': ('M', 0), 'm': ('m', 0),
'N': ('N', 0), 'n': ('n', 0),
'Ń': ('N', 1), 'ń': ('n', 1),
'O': ('O', 0), 'o': ('o', 0),
'Ó': ('O', 1), 'ó': ('o', 1),
'P': ('P', 0), 'p': ('p', 0),
'R': ('R', 0), 'r': ('r', 0),
'S': ('S', 0), 's': ('s', 0),
'Ś': ('S', 1), 'ś': ('s', 1),
'T': ('T', 0), 't': ('t', 0),
'U': ('U', 0), 'u': ('u', 0),
'W': ('W', 0), 'w': ('w', 0),
'Y': ('Y', 0), 'y': ('y', 0),
'Z': ('Z', 0), 'z': ('z', 0),
'Ź': ('Z', 1), 'ź': ('z', 1),
'Ż': ('Z', 2), 'ż': ('z', 2),
}
def pol2ascii(text):
plain = []
diacritics = []
for c in text:
ascii_char, diacritic = trans_table.get(c, (c, 0))
plain.append(ascii_char)
diacritics.append(str(diacritic))
return ''.join(plain) + '_' + hex(int('1' + ''.join(reversed(diacritics)), 3))[2:]
reverse_trans_table = {
k: v for v, k in trans_table.items()
}
def ascii2pol(text):
plain, diacritics = text.rsplit('_', 1)
diacritics = int(diacritics, base=16)
res = []
for c in plain:
diacritic = diacritics % 3
diacritics = diacritics // 3
pol_char = reverse_trans_table.get((c, diacritic), c)
res.append(pol_char)
return ''.join(res)
TESTS = '''
Świętosław Milczący
Dzierżykraj Łaźniński
Józef Soćko
'''
for l in TESTS.strip().splitlines():
plain = pol2ascii(l)
original = ascii2pol(plain)
print(original, plain)
assert original == l