xmlpysparkxpath

XPath Query Returns Lists Omitting Missing Values Instead of Including None


I have a PySpark DataFrame with a column containing XML strings, and I'm using XPath queries with absolute paths to extract data from these XML strings. However, I've noticed that the XPath queries return lists that omit values if they are not present, rather than including None in their place. I would like to keep the length of the lists consistent, filling in None where data is missing.

Here is the sample data and code I'm working with:

data = [
    (1, """<root>
    <level1>
        <level2>
            <level3>
            <data2>Lion</data2>
                <level4>
                    <data>Apple</data>
                </level4>
            </level3>
        </level2>
        <level2>
            <level3>
                <level4>
                    <data>Banana</data>
                </level4>
            </level3>
        </level2>
        <level2>
            <level3>
            <data2>Tiger</data2>
                <level4>
                    <data>Cranberry</data>
                </level4>
            </level3>
        </level2>
    </level1>
</root>"""),
    (2, """<root>
    <level1>
        <level2>
            <level3>
            <data2>Lion</data2>
                <level4>
                    <data>Apple</data>
                </level4>
            </level3>
        </level2>
        <level2>
            <level3>
            <data2>Tiger</data2>
                <level4>
                    <data>Banana</data>
                </level4>
            </level3>
        </level2>
        <level2>
            <level3>
                <data2>Zebra</data2>
                <level4></level4>
            </level3>
        </level2>
    </level1>
</root>""")

df = spark.createDataFrame(data, ["id", "xml_string"])

What the XPath queries return:

For data column:

(1, \["Apple","Banana","Cranberry"\], \["Lion","Tiger"\])
(2, \["Apple","Banana"\], \["Lion","Tiger","Zebra"\])

What I want:

For data column:

(1, \["Apple","Banana","Cranberry"\], \["Lion", None, "Tiger"\])
(2, \["Apple","Banana", None\], \["Lion","Tiger","Zebra"\])

How can I adjust my XPath queries?

root/level1/level2/level3/level4/data
root/level1/level2/level3/data2

Solution

  • It's tricky because (unless I'm mistaken) your XPath implementation here is only version 1.0. In later XPath versions you could write something like:

    for $item in 
       root/level1/level2/level3/level4/data
    return
       if ($item/data) then
          $item/data
       else
          "NULL"
    

    ... and return a sequence of items which were either

    Unfortunately, XPath 1.0's data model does not have the "sequence" data type offered by later versions. Instead it has the "nodeset" data type, which as the name implies is necessarily composed only of nodes drawn from the XML. This means your XPath query can't return a mixture of nodes and strings. If you want to return multiple values, they must be fragments extracted from your actual XML document.

    However, there's a possibility that may meet your requirements: when a level4 element has no data child element, you could try returning the first text node that follows the level4 element. That text node contains just white space; a new line character followed by 12 space characters. Here's a snippet of your XML, showing the text nodes more clearly by representing the white space characters with visible symbols, and highlighting the particular text node we would want to return in the case of a level4 element not containing a data element.

    ············<level3>␤
    ················<data2>Zebra</data2>␤
    ················<level4></level4>
    ············</level3>␤

    The appropriate XPath query would be:

    
    root/level1/level2/level3/level4/data
    |
    root/level1/level2/level3/level4[not(data)]/following-sibling::text()[1]
    
    

    The | (set union) operator will join the results of the two path expressions and should ensure you get one node in your result for every level4. The resulting list of nodes should be returned in the same order as they appear in the document, too. You should end up with a dataframe where missing values are represented as strings containing just white space, which should hopefully not be too onerous to deal with in a subsequent step (assuming you can distinguish these "missing data" values from the data in your actual data elements).