I'm trying to batch-process a couple of XML files through a python script, with the XML files having line breaks in some of their attributes like so:
<?xml version='1.0' encoding='UTF-8'?>
<xml>
<test value="This is
a test
with line breaks
"/>
</xml>
However, I noticed that line breaks within attributes are removed when parsing this file. For example, the following script:
import lxml.etree as ET
with open("input.xml", "r", encoding="utf-8") as f:
source = ET.parse(f)
root = source.getroot()
dest = ET.ElementTree(root)
dest.write("output.xml", encoding="utf-8", xml_declaration=True)
would produce the following output file:
<?xml version='1.0' encoding='UTF-8'?>
<xml>
<test value="This is a test with line breaks "/>
</xml>
While this seems to be in line with W3Cs recommendations as per this related answer, is there a way to use xml.etree
or lxml.etree
for modifying the XML file without removing those line breaks?
One dirty hack what I think could be with bs4:
import re
from bs4 import BeautifulSoup
# Read XML as text
with open("input.xml", encoding="utf-8") as f:
xml_text = f.read()
# extract attributevalue per regex and store it e.g. in a dict key, value
match = re.search(r'value="(.*?)"', xml_text, re.DOTALL)
raw_value = match.group(1) if match else None
print("Raw string: ", repr(raw_value))
print()
# Parse your xml (replace attribute by placeholder e.g. key if necessary))
soup = BeautifulSoup(xml_text, "xml")
test_tag = soup.find("test")
# reset attributvalue from dict
if raw_value is not None:
test_tag["value"] = raw_value
# result
print(soup.prettify())
Output:
Raw string: 'This is\n a test\n\n with line breaks\n '
<?xml version="1.0" encoding="utf-8"?>
<xml>
<test value="This is
a test
with line breaks
"/>
</xml>