I have this UTF-8 string:
s = "Naděždaüäö"
Which I'd like to convert to a UTF-8 string which can be encoded
in "latin-1" without throwing an exception. I'd like to do so by replacing every character which cannot be found in latin-1 by its closest representation, say in ascii or so.
Since "ěž" are not in latin-1, I'd like these to be converted to "ez", while "üäö" are in latin-1, so they should not be converted to "uao" but stay as "üäö".
My first try looked like this:
import unicodedata
def convert(s):
return unicodedata.normalize(
'NFKD', s
).encode(
'latin-1', 'ignore'
).decode('latin-1')
And this got me at least this far:
s = "Naděžda"
print(convert(s)) # --> "Nadezda"
But then I realized that this will also convert the "äöü" as can be seen here:
s = "Naděždaäöü"
print(convert(s)) # --> "Nadezdaaou"
Alternatively I tried:
def convert2(s):
return unicodedata.normalize(
'NFKC', s
).encode(
'latin-1', 'ignore'
).decode('latin-1')
Which leads to:
s = "Naděždaäöü"
print(convert(s)) # --> "Naddaäöü"
Thanks for your help.
if you just do it char by char it works, (though it's not super clean)
def convert(s):
r=''
for c in s:
try:
c.encode('latin-1')
except UnicodeEncodeError:
c = unicodedata.normalize('NFKD', c).encode('latin-1', 'ignore').decode('latin-1')
r += c
return r