pythonpython-3.xxmlparsinglxml

lxml target interface splits data on non-ascii characters -- how can I get the whole string?


Here's a file test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<list>
  <entry>data</entry>
  <entry>Łódź</entry>
  <entry>data Łódź</entry>
</list>

and here's a simple python script to parse it into a list with lxml:

from lxml import etree

class ParseTarget:
    def __init__(self):
        self.entries = []
    def start(self, tag, attrib):
        pass
    def end(self, tag):
        pass
    def data(self, data):
        str = data.strip()
        if str != '':
            self.entries.append(data)
    def close(self):
        # Reset parser
        entries = self.entries
        self.entries = []
        # And return results
        return entries

target = etree.XMLParser(target=ParseTarget(),
                         # Including/removing this makes no difference
                         encoding='UTF-8')

tree = etree.parse("./test.xml", target)

# Expected value of tree:
# ['data', 'Łódź', 'data Łódź']
# Actual value of tree
# ['data', 'Łódź', 'data ', 'Łódź']
# What gives!!!?

As the comment says, I would expect to end up with a list of three elements, but I get four. This is a minimal demonstration of a general problem: including strings with non-ascii characters (but at least one ascii char at the beginning) results in not a single string, but a list of two strings, split on where the non-ascii chars start.

I don't want this to happen (i.e. I want to just get a list of three strings). What should I do?

I'm using Python 3.11.2


Solution

  • You have to use the end handler to reset:

    Explanation of Steps

    The third <entry> (<entry>data Łódź</entry>) has mixed content: "data Łódź". The parser may split "data Łódź" into multiple data() calls:

    This is why we need to accumulate text correctly to "data Łódź".

    from lxml import etree
    
    class ParseTarget:
        def __init__(self):
            self.entries = []
            self.current_text = []
        
        def start(self, tag, attrib):
            self.current_text = []
        
        def end(self, tag):
            if self.current_text:
                self.entries.append(" ".join(self.current_text))
            self.current_text = []  # Reset for the next element
        
        def data(self, data):
            if data.strip():  # Ignore completely empty segments but keep spaces
                self.current_text.append(data)  # Append raw data, preserving spaces
        
        def close(self):
            entries = self.entries
            self.entries = []
            return entries
    
    target = etree.XMLParser(target=ParseTarget(), encoding='UTF-8')
    
    tree = etree.parse("./test.xml", target)
    print(tree)
    # ['data', 'Łódź', 'data  Łódź']