transliterationpolish

How to transliterate Polish alphabet with US-ASCII?


Is there a more or less standard way to transliterate Polish alphabet with the original ASCII (US-ASCII) characters?

This question can be broken in two related and more precise questions:

  1. How to transliterate 32 letters of Polish alphabet with only 26 letters of basic Latin alphabet maximizing understanding by a Polish reader?
  2. Is there a reversible way to transliterate any Polish text with US-ASCII characters?

I can see that most Polish websites just remove the diacritics in their URLs. For example:

Świętosław Milczący    →  Swietoslaw Milczacy
Dzierżykraj Łaźniński  →  Dzierzykraj Lazninski
Józef Soćko            →  Jozef Socko

This is hardly reversible, but is it the most readable transliteration for Polish readers?

In some other cases, more complicated ad hoc transliteration might be used, like Wałęsa → Wawensa. Are there any standard rules for doing this latter kind of transformations?

P.S. Just to clarify, I'm interested in transliteration rules (like ł → w, ę → en), not the implementation. Something like this table.


Solution

  • You could encode presense of diacritics as some kind of ternary number, and store them near the plain ASCII transliteration to make it reversible.

    URLs often contain some additional IDs, even this one: 48686148/how-to-transliterate-polish-alphabet-with-us-ascii

    Here is example implementation:

    trans_table = {
        'A': ('A', 0),   'a': ('a', 0),
        'Ą': ('A', 1),   'ą': ('a', 1),
        'B': ('B', 0),   'b': ('b', 0),
        'C': ('C', 0),   'c': ('c', 0),
        'Ć': ('C', 1),   'ć': ('c', 1),
        'D': ('D', 0),   'd': ('d', 0),
        'E': ('E', 0),   'e': ('e', 0),
        'Ę': ('E', 1),   'ę': ('e', 1),
        'F': ('F', 0),   'f': ('f', 0),
        'G': ('G', 0),   'g': ('g', 0),
        'H': ('H', 0),   'h': ('h', 0),
        'I': ('I', 0),   'i': ('i', 0),
        'J': ('J', 0),   'j': ('j', 0),
        'K': ('K', 0),   'k': ('k', 0),
        'L': ('L', 0),   'l': ('l', 0),
        'Ł': ('L', 1),   'ł': ('l', 1),
        'M': ('M', 0),   'm': ('m', 0),
        'N': ('N', 0),   'n': ('n', 0),
        'Ń': ('N', 1),   'ń': ('n', 1),
        'O': ('O', 0),   'o': ('o', 0),
        'Ó': ('O', 1),   'ó': ('o', 1),
        'P': ('P', 0),   'p': ('p', 0),
        'R': ('R', 0),   'r': ('r', 0),
        'S': ('S', 0),   's': ('s', 0),
        'Ś': ('S', 1),   'ś': ('s', 1),
        'T': ('T', 0),   't': ('t', 0),
        'U': ('U', 0),   'u': ('u', 0),
        'W': ('W', 0),   'w': ('w', 0),
        'Y': ('Y', 0),   'y': ('y', 0),
        'Z': ('Z', 0),   'z': ('z', 0),
        'Ź': ('Z', 1),   'ź': ('z', 1),
        'Ż': ('Z', 2),   'ż': ('z', 2),
    }
    
    
    
    def pol2ascii(text):
        plain = []
        diacritics = []
        for c in text:
            ascii_char, diacritic = trans_table.get(c, (c, 0))
            plain.append(ascii_char)
            diacritics.append(str(diacritic))
    
        return ''.join(plain) + '_' + hex(int('1' + ''.join(reversed(diacritics)), 3))[2:]
    
    reverse_trans_table = {
        k: v for v, k in trans_table.items()
    }
    
    def ascii2pol(text):
        plain, diacritics = text.rsplit('_', 1)
        diacritics = int(diacritics, base=16)
        res = []
    
        for c in plain:
            diacritic = diacritics % 3
            diacritics = diacritics // 3
            pol_char = reverse_trans_table.get((c, diacritic), c)
            res.append(pol_char)
    
        return ''.join(res)
    
    
    TESTS = '''
    Świętosław Milczący
    Dzierżykraj Łaźniński
    Józef Soćko
    '''
    
    for l in TESTS.strip().splitlines():
        plain = pol2ascii(l)
        original = ascii2pol(plain)
        print(original, plain)
        assert original == l