pythondjangoreplacephp4non-unicode

Python efficient mass replacing unknown characterers


PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?

I have textField with various letters and i want to replace those in this fashion:

àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ

I had similar case where i had to clean up the code so i used this:

def clean(string):
     return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )

Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):

{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
 '\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
 '\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
 '\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'

So the main problem remains - replacing


Solution

  • Try the str.replace() method - should work with unicode strings.

    str.replace(old, new[, count])

    Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

    Make sure your old and new strings are of type Unicode (that applies to your input data as well).

    Find out what your input (non-unicode) string is supposed to be encoded in. For example, it may be in latin1 encoding. Use the builtin str.decode() method to create a Unicode version of your data, and feed that to str.replace().

    >>> unioldchars = oldchars.decode("latin1")
    >>> newdata = data.replace(unioldchars, newchars)