pythonunicodepython-module-unicodedata

Python convert this utf8 string to latin1


I have this UTF-8 string:

s = "Naděždaüäö"

Which I'd like to convert to a UTF-8 string which can be encoded in "latin-1" without throwing an exception. I'd like to do so by replacing every character which cannot be found in latin-1 by its closest representation, say in ascii or so.

Since "ěž" are not in latin-1, I'd like these to be converted to "ez", while "üäö" are in latin-1, so they should not be converted to "uao" but stay as "üäö".

My first try looked like this:

import unicodedata

def convert(s):
    return unicodedata.normalize(
        'NFKD', s
    ).encode(
        'latin-1', 'ignore'
    ).decode('latin-1')

And this got me at least this far:

s = "Naděžda"
print(convert(s))  # --> "Nadezda"

But then I realized that this will also convert the "äöü" as can be seen here:

s = "Naděždaäöü"
print(convert(s))  # --> "Nadezdaaou"

Alternatively I tried:

def convert2(s):
    return unicodedata.normalize(
        'NFKC', s
    ).encode(
        'latin-1', 'ignore'
    ).decode('latin-1')

Which leads to:

s = "Naděždaäöü"
print(convert(s))  # --> "Naddaäöü"

Thanks for your help.


Solution

  • if you just do it char by char it works, (though it's not super clean)

    def convert(s):
        r=''
        for c in s:
            try:
                c.encode('latin-1')
            except UnicodeEncodeError:
                c = unicodedata.normalize('NFKD', c).encode('latin-1', 'ignore').decode('latin-1')
            r += c
        return r