I have a PySpark DataFrame with a column containing XML strings, and I'm using XPath queries with absolute paths to extract data from these XML strings. However, I've noticed that the XPath queries return lists that omit values if they are not present, rather than including None in their place. I would like to keep the length of the lists consistent, filling in None where data is missing.
Here is the sample data and code I'm working with:
data = [
(1, """<root>
<level1>
<level2>
<level3>
<data2>Lion</data2>
<level4>
<data>Apple</data>
</level4>
</level3>
</level2>
<level2>
<level3>
<level4>
<data>Banana</data>
</level4>
</level3>
</level2>
<level2>
<level3>
<data2>Tiger</data2>
<level4>
<data>Cranberry</data>
</level4>
</level3>
</level2>
</level1>
</root>"""),
(2, """<root>
<level1>
<level2>
<level3>
<data2>Lion</data2>
<level4>
<data>Apple</data>
</level4>
</level3>
</level2>
<level2>
<level3>
<data2>Tiger</data2>
<level4>
<data>Banana</data>
</level4>
</level3>
</level2>
<level2>
<level3>
<data2>Zebra</data2>
<level4></level4>
</level3>
</level2>
</level1>
</root>""")
df = spark.createDataFrame(data, ["id", "xml_string"])
What the XPath queries return:
For data column:
(1, \["Apple","Banana","Cranberry"\], \["Lion","Tiger"\])
(2, \["Apple","Banana"\], \["Lion","Tiger","Zebra"\])
What I want:
For data column:
(1, \["Apple","Banana","Cranberry"\], \["Lion", None, "Tiger"\])
(2, \["Apple","Banana", None\], \["Lion","Tiger","Zebra"\])
How can I adjust my XPath queries?
root/level1/level2/level3/level4/data
root/level1/level2/level3/data2
It's tricky because (unless I'm mistaken) your XPath implementation here is only version 1.0. In later XPath versions you could write something like:
for $item in
root/level1/level2/level3/level4/data
return
if ($item/data) then
$item/data
else
"NULL"
... and return a sequence of items which were either
data
element nodes, specifically), ordata
elements.Unfortunately, XPath 1.0's data model does not have the "sequence" data type offered by later versions. Instead it has the "nodeset" data type, which as the name implies is necessarily composed only of nodes drawn from the XML. This means your XPath query can't return a mixture of nodes and strings. If you want to return multiple values, they must be fragments extracted from your actual XML document.
However, there's a possibility that may meet your requirements: when a level4
element has no data
child element, you could try returning the first text node that follows the level4
element. That text node contains just white space; a new line character followed by 12 space characters. Here's a snippet of your XML, showing the text nodes more clearly by representing the white space characters with visible symbols, and highlighting the particular text node we would want to return in the case of a level4
element not containing a data
element.
············<level3>
················<data2>Zebra</data2>
················<level4></level4>
············</level3>
The appropriate XPath query would be:
root/level1/level2/level3/level4/data
|
root/level1/level2/level3/level4[not(data)]/following-sibling::text()[1]
The |
(set union) operator will join the results of the two path expressions and should ensure you get one node in your result for every level4
. The resulting list of nodes should be returned in the same order as they appear in the document, too. You should end up with a dataframe where missing values are represented as strings containing just white space, which should hopefully not be too onerous to deal with in a subsequent step (assuming you can distinguish these "missing data" values from the data in your actual data
elements).