pythonunicodeencodingutf-8elementtree

ElementTree and unicode


I have this char in an xml file:

<data>
  <products>
      <color>fumè</color>
  </product>
</data>

I try to generate an instance of ElementTree with the following code:

string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))

and I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)

(NOTE: The position is not exact, I sampled the xml from a larger one).

How to solve it? Thanks


Solution

  • You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:

    >>> data = '''\
    ... <data>
    ...   <products>
    ...       <color>fumè</color>
    ...   </products>
    ... </data>
    ... '''
    >>> x = ElementTree.fromstring(data)
    >>> x[0][0].text
    u'fum\xe8'
    

    If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:

    x = ElementTree.parse('file.xml')