Here's a file test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<entry>data</entry>
<entry>Łódź</entry>
<entry>data Łódź</entry>
</list>
and here's a simple python script to parse it into a list with lxml:
from lxml import etree
class ParseTarget:
def __init__(self):
self.entries = []
def start(self, tag, attrib):
pass
def end(self, tag):
pass
def data(self, data):
str = data.strip()
if str != '':
self.entries.append(data)
def close(self):
# Reset parser
entries = self.entries
self.entries = []
# And return results
return entries
target = etree.XMLParser(target=ParseTarget(),
# Including/removing this makes no difference
encoding='UTF-8')
tree = etree.parse("./test.xml", target)
# Expected value of tree:
# ['data', 'Łódź', 'data Łódź']
# Actual value of tree
# ['data', 'Łódź', 'data ', 'Łódź']
# What gives!!!?
As the comment says, I would expect to end up with a list of three elements, but I get four. This is a minimal demonstration of a general problem: including strings with non-ascii characters (but at least one ascii char at the beginning) results in not a single string, but a list of two strings, split on where the non-ascii chars start.
I don't want this to happen (i.e. I want to just get a list of three strings). What should I do?
I'm using Python 3.11.2
You have to use the end handler to reset:
Explanation of Steps
The third <entry> (<entry>data Łódź</entry>) has mixed content: "data Łódź". The parser may split "data Łódź" into multiple data() calls:
First: "data " (with a space at the end).
Second: "Łódź".
This is why we need to accumulate text correctly to "data Łódź".
from lxml import etree
class ParseTarget:
def __init__(self):
self.entries = []
self.current_text = []
def start(self, tag, attrib):
self.current_text = []
def end(self, tag):
if self.current_text:
self.entries.append(" ".join(self.current_text))
self.current_text = [] # Reset for the next element
def data(self, data):
if data.strip(): # Ignore completely empty segments but keep spaces
self.current_text.append(data) # Append raw data, preserving spaces
def close(self):
entries = self.entries
self.entries = []
return entries
target = etree.XMLParser(target=ParseTarget(), encoding='UTF-8')
tree = etree.parse("./test.xml", target)
print(tree)
# ['data', 'Łódź', 'data Łódź']