pythonlxmlelementtree

Preserve line breaks in XML attributes when parsing with lxml


I'm trying to batch-process a couple of XML files through a python script, with the XML files having line breaks in some of their attributes like so:

<?xml version='1.0' encoding='UTF-8'?>
<xml>
    <test value="This is
    a test

    with line breaks
    "/>
</xml>

However, I noticed that line breaks within attributes are removed when parsing this file. For example, the following script:

import lxml.etree as ET

with open("input.xml", "r", encoding="utf-8") as f:
    source = ET.parse(f)
    root = source.getroot()
    dest = ET.ElementTree(root)
    dest.write("output.xml", encoding="utf-8", xml_declaration=True)

would produce the following output file:

<?xml version='1.0' encoding='UTF-8'?>
<xml>
    <test value="This is  a test   with line breaks  "/>
</xml>

While this seems to be in line with W3Cs recommendations as per this related answer, is there a way to use xml.etree or lxml.etree for modifying the XML file without removing those line breaks?


Solution

  • One dirty hack what I think could be with bs4:

    import re
    from bs4 import BeautifulSoup
    
    # Read XML as text 
    with open("input.xml", encoding="utf-8") as f:
        xml_text = f.read()
    
    # extract attributevalue per regex and store it e.g. in a dict key, value
    match = re.search(r'value="(.*?)"', xml_text, re.DOTALL)
    raw_value = match.group(1) if match else None
    print("Raw string: ", repr(raw_value))
    print()
    
    # Parse your xml (replace attribute by placeholder e.g. key if necessary))
    soup = BeautifulSoup(xml_text, "xml")
    test_tag = soup.find("test")
    
    # reset attributvalue from dict
    if raw_value is not None:
        test_tag["value"] = raw_value
    
    # result
    print(soup.prettify())
    

    Output:

    Raw string:  'This is\n    a test\n\n    with line breaks\n    '
    
    <?xml version="1.0" encoding="utf-8"?>
    <xml>
     <test value="This is
        a test
    
        with line breaks
        "/>
    </xml>