htmlxmltclditatdom

Change <br> tag to <?linebreak?> using tcl tdom


I have an input string in html that needs to be parsed and written to DITA compatible XML.

Input:

<p>Line with following newline<br>Line with two following newlines<br><br>Line with no following newline</p>

Desired Output:

<p>Line with following newline<?linebreak?>Line with two following newlines<?linebreak?><?linebreak?>Line with no following newline</p>

package require tdom

set xml {<p>Line with following newline<br>Line with two following newlines<br><br>Line with no following newline</p>}

puts "Input:"
puts "$xml"

set doc [dom parse -html -keepEmpties $xml]
set root [$doc documentElement]

foreach node [$root getElementsByTagName br] {
    $node delete
    #$node appendXML "<?linebreak?>"

}

puts "Output:"
puts [$doc asXML -indent none]

If I uncomment #$node appendXML "<?linebreak?>", the script fails. I'm new to tdom but not tcl. Or....maybe someone has a different idea on how to preserve linebreaks in XML, specifically DITA.


Solution

  • Once you call delete on a tdom node, it no longer exists, so naturally you get an error if you then try to use it after.

    One approach: For each br node, create a new processing instruction node, and then replace the br one with it (Which first requires getting the node's parent). Your loop would then look like:

    foreach node [$root getElementsByTagName br] {
        set lb [$doc createProcessingInstruction linebreak ""]
        [$node parentNode] replaceChild $lb $node
        # replaceChild moves the old node to the document fragment list;
        # just get rid of it completely since we're not going to reuse it
        $node delete
    }
    

    and the modified program prints out

    Input:
    <p>Line with following newline<br>Line with two following newlines<br><br>Line with no following newline</p>
    Output:
    <html><p>Line with following newline<?linebreak ?>Line with two following newlines<?linebreak ?><?linebreak ?>Line with no following newline</p></html>