xmljqyqxq

How to deal with embeded html in xml with jq yq xq (xml to yaml conversion)


I have an xml dictionary file based on the xdxf dictionary format here that I would like to convert (and round trip) to yaml.

This format (with DTD) may contain <kref> (cross reference) elements around a word that is already surrounded by <deftext> tags (definitions). Or it may contain for example <sub> tags to indicate a word in subscript. I have not been able to see how to manage xml to yaml conversion of these files with yq (either the go or python) version.

An abreviated sample.xml (from the xdxf repo)

<lexicon>
    <ar>
        <k id="fb982hk">Society</k>
        <def>
            <deftext>Plural form of word <kref>index</kref>.
            </deftext>
        </def>
    </ar>
    <ar>
        <k>CO
            <sub>2</sub>
        </k>
        <def>
            <deftext>Carbon dioxide (CO<sub>2</sub>) - a heavy odorless gas formed during respiration.
            </deftext>
        </def>
    </ar>
  </lexicon>

converted to yaml via yq (go) will render:

 yq -p=xml -o=yaml < sample.xml 
lexicon:
  ar:
    - k:
        +content: Society
        +@id: fb982hk
      def:
        deftext:
          +content:
            - Plural form of word
            - .
          kref: index
    - k:
        +content: CO
        sub: "2"
      def:
        deftext:
          +content:
            - Carbon dioxide (CO
            - ) - a heavy odorless gas formed during respiration.
          sub: "2"

converted to yaml via yq (python) will render:

 xq < sample.xml | yq -y 
lexicon:
  ar:
    - k:
        '@id': fb982hk
        '#text': Society
      def:
        deftext:
          kref: index
          '#text': Plural form of word .
    - k:
        sub: '2'
        '#text': CO
      def:
        deftext:
          sub: '2'
          '#text': Carbon dioxide (CO) - a heavy odorless gas formed during respiration.

In both cases the <kref> and <sub> elements will no longer 'surround' the correct text and a return to xml will not be correct either. Is this just a limitation of the format? Or is there some way to accomodate (or maybe ignore as xml?) these tags?


Solution

  • The XML syntax isn't the problem.

    You're struggling with the (general) way both mikefarah/yq and kislyuk/yq chose to represent the XML tree in JSON/YAML. There is no canonical solution to that, and both these approaches are lossy wrt to "Complex Types with Mixed Content", i.e. element nodes embedded into floating-around text nodes.

    But modifying the XML syntax may be a solution.

    If you don't care about the markup information conveyed by the elements in question, you could flatten out these passages in a pre-processing step, e.g. using a simple XSL transformation like

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:template match="node()|@*">
            <xsl:copy>
                <xsl:apply-templates select="node()|@*"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="kref|sub">
            <xsl:value-of select="."/>
        </xsl:template>
    </xsl:stylesheet>
    

    This uses a template matching node()|@* which just replicates all elements and attributes, and another one that overrides this behavior for the kref and sub elements by copying over just their textual content.

    Apply this XSLT to your XML document using an XSLT processor such as xsltproc, Saxon, or Xalan, and you should get the stripped version of your input:

    <lexicon>
      <ar>
        <k id="fb982hk">Society</k>
        <def>
          <deftext>
            Plural form of word index.
          </deftext>
        </def>
      </ar>
      <ar>
        <k>CO2</k>
        <def>
          <deftext>
            Carbon dioxide (CO2) - a heavy odorless gas formed during respiration.
          </deftext>
        </def>
      </ar>
    </lexicon>
    

    This can then be applied to your original xq/yq pipeline.