I have an xml dictionary file based on the xdxf dictionary format here that I would like to convert (and round trip) to yaml.
This format (with DTD) may contain <kref>
(cross reference) elements around a word that is already surrounded by <deftext>
tags (definitions). Or it may contain for example <sub>
tags to indicate a word in subscript. I have not been able to see how to manage xml to yaml conversion of these files with yq (either the go or python) version.
<lexicon>
<ar>
<k id="fb982hk">Society</k>
<def>
<deftext>Plural form of word <kref>index</kref>.
</deftext>
</def>
</ar>
<ar>
<k>CO
<sub>2</sub>
</k>
<def>
<deftext>Carbon dioxide (CO<sub>2</sub>) - a heavy odorless gas formed during respiration.
</deftext>
</def>
</ar>
</lexicon>
yq -p=xml -o=yaml < sample.xml
lexicon:
ar:
- k:
+content: Society
+@id: fb982hk
def:
deftext:
+content:
- Plural form of word
- .
kref: index
- k:
+content: CO
sub: "2"
def:
deftext:
+content:
- Carbon dioxide (CO
- ) - a heavy odorless gas formed during respiration.
sub: "2"
xq < sample.xml | yq -y
lexicon:
ar:
- k:
'@id': fb982hk
'#text': Society
def:
deftext:
kref: index
'#text': Plural form of word .
- k:
sub: '2'
'#text': CO
def:
deftext:
sub: '2'
'#text': Carbon dioxide (CO) - a heavy odorless gas formed during respiration.
In both cases the <kref>
and <sub>
elements will no longer 'surround' the correct text and a return to xml will not be correct either. Is this just a limitation of the format? Or is there some way to accomodate (or maybe ignore as xml?) these tags?
You're struggling with the (general) way both mikefarah/yq and kislyuk/yq chose to represent the XML tree in JSON/YAML. There is no canonical solution to that, and both these approaches are lossy wrt to "Complex Types with Mixed Content", i.e. element nodes embedded into floating-around text nodes.
If you don't care about the markup information conveyed by the elements in question, you could flatten out these passages in a pre-processing step, e.g. using a simple XSL transformation like
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="kref|sub">
<xsl:value-of select="."/>
</xsl:template>
</xsl:stylesheet>
This uses a template matching node()|@*
which just replicates all elements and attributes, and another one that overrides this behavior for the kref
and sub
elements by copying over just their textual content.
Apply this XSLT to your XML document using an XSLT processor such as xsltproc
, Saxon
, or Xalan
, and you should get the stripped version of your input:
<lexicon>
<ar>
<k id="fb982hk">Society</k>
<def>
<deftext>
Plural form of word index.
</deftext>
</def>
</ar>
<ar>
<k>CO2</k>
<def>
<deftext>
Carbon dioxide (CO2) - a heavy odorless gas formed during respiration.
</deftext>
</def>
</ar>
</lexicon>
This can then be applied to your original xq
/yq
pipeline.