pythonunicodeencodingcharacter-encoding

Get a list of all the encodings Python can encode to


I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?

The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.

i.e. something like this:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
  try:
    unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
  except:
    pass

Solution

  • Unfortunately encodings.aliases.aliases.keys() is NOT an appropriate answer.

    aliases(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252 and windows_1252 are both mapped to cp1252. You could save time if instead of aliases.keys() you use set(aliases.values()).

    BUT THERE'S A WORSE PROBLEM: aliases doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).

    >>> from encodings.aliases import aliases
    >>> def find(q):
    ...     return [(k,v) for k, v in aliases.items() if q in k or q in v]
    ...
    >>> find('1252') # multiple aliases
    [('1252', 'cp1252'), ('windows_1252', 'cp1252')]
    >>> find('856') # no codepage 856 in aliases
    []
    >>> find('koi8') # no koi8_u in aliases
    [('cskoi8r', 'koi8_r')]
    >>> 'x'.decode('cp856') # but cp856 is a valid codec
    u'x'
    >>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
    u'x'
    >>>
    

    It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib, quopri, and base64.

    Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.

    For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?

    What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].