pythonxmllxmlbasexxpath-1.0

number() not supported in LXML XPath parsing?


I am running into some unexpected issues with XPath. I have a query that runs fine when using a database system like BaseX, but in Python with lxml it throws an error.

Here is an example:

import lxml.etree as ET


xml = """<tree>
    <node begin="0" cat="top" end="27" id="0" rel="top">
      <node begin="0" end="1" frame="punct(aanhaal_both)" id="1" lcat="--" lemma="&apos;" pos="punct" postag="LET()" pt="let" rel="--" root="&quot;" sense="&quot;" special="aanhaal_both" word="&quot;"/>
      <node begin="12" end="13" frame="punct(komma)" id="2" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
      <node begin="1" cat="du" end="26" id="3" rel="--">
        <node begin="1" cat="du" end="20" id="4" rel="dp">
          <node begin="1" cat="cp" end="12" id="5" rel="sat">
            <node begin="1" end="2" frame="complementizer(al)" id="6" lcat="cp" lemma="al" pos="comp" postag="BW()" pt="bw" rel="cmp" root="al" sc="al" sense="al" word="Al"/>
            <node begin="2" cat="sv1" end="12" id="7" rel="body">
              <node begin="2" end="3" frame="verb(hebben,sg1,transitive)" id="8" infl="sg1" lcat="sv1" lemma="geven" pos="verb" postag="WW(pv,tgw,ev)" pt="ww" pvagr="ev" pvtijd="tgw" rel="hd" root="geef" sc="transitive" sense="geef" tense="present" word="geef" wvorm="pv"/>
              <node begin="3" case="both" def="def" end="4" frame="pronoun(nwh,je,sg,de,both,def,wkpro)" gen="de" getal="ev" id="9" lcat="np" lemma="je" naamval="nomin" num="sg" pdtype="pron" per="je" persoon="2v" pos="pron" postag="VNW(pers,pron,nomin,red,2v,ev)" pt="vnw" rel="su" root="je" sense="je" special="wkpro" status="red" vwtype="pers" wh="nwh" word="je"/>
              <node begin="4" cat="np" end="12" id="10" rel="obj1">
                <node begin="4" end="5" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="11" lcat="np" lemma="programmeur" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="programmeur" sense="programmeur" word="programmeurs"/>
                <node begin="5" cat="pp" end="12" id="12" rel="mod">
                  <node begin="5" end="6" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="13" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                  <node begin="6" cat="np" end="12" id="14" rel="obj1">
                    <node begin="6" end="7" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="15" lcat="np" lemma="vertaalcomputer" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="vertaal_computer" sense="vertaal_computer" word="vertaalcomputers"/>
                    <node begin="7" cat="pp" end="12" id="16" rel="mod">
                      <node begin="7" end="8" frame="er_vp_adverb" getal="getal" id="17" lcat="advp" lemma="er" naamval="stan" pdtype="adv-pron" persoon="3" pos="adv" postag="VNW(aanw,adv-pron,stan,red,3,getal)" pt="vnw" rel="obj1" root="er" sense="er" special="er" status="red" vwtype="aanw" word="er"/>
                      <node begin="8" cat="np" end="11" id="18" rel="mod">
                        <node begin="8" end="9" frame="modal_adverb" id="19" lcat="advp" lemma="nog" pos="adv" postag="BW()" pt="bw" rel="mod" root="nog" sc="modal" sense="nog" word="nog"/>
                        <node begin="9" end="10" frame="number(hoofd(pl_num))" id="20" infl="pl_num" lcat="detp" lemma="vijftig" naamval="stan" numtype="hoofd" pos="num" positie="prenom" postag="TW(hoofd,prenom,stan)" pt="tw" rel="det" root="vijftig" sense="vijftig" special="hoofd" word="vijftig"/>
                        <node begin="10" end="11" frame="tmp_noun(het,count,meas)" gen="het" genus="onz" getal="ev" graad="basis" id="21" lcat="np" lemma="jaar" naamval="stan" ntype="soort" num="meas" pos="noun" postag="N(soort,ev,basis,onz,stan)" pt="n" rel="hd" root="jaar" sense="jaar" special="tmp" word="jaar"/>
                      </node>
                      <node begin="11" end="12" frame="preposition(bij,[vandaan])" id="22" lcat="pp" lemma="bij" pos="prep" postag="VZ(fin)" pt="vz" rel="hd" root="bij" sense="bij" vztype="fin" word="bij"/>
                    </node>
                  </node>
                </node>
              </node>
            </node>
          </node>
        </node>
      </node>
      <node begin="26" end="27" frame="punct(punt)" id="42" lcat="--" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
    </node>
  </tree>
"""

xpath = """//node[@cat="cp" and node[@rel="cmp" and @pt="vg" and number(@begin) < ../node[@rel="body" and @cat="ssub"]/node[@rel="vc" and @cat="ppart"]/node[@rel="hd" and @pt="ww"]/number(@begin)] and node[@rel="body" and @cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pt="ww" and number(@begin) < ../../node[@rel="hd" and @pt="ww"]/number(@begin)]] and node[@rel="hd" and @pt="ww"]]]"""

root = ET.fromstring(xml)
results = root.xpath(xpath)
print(results)

Here is a beautified version of the XPath:

//node[
    @cat="cp" and 
    node[
        @rel="cmp" and 
        @pt="vg" and 
        number(@begin) < ../node[
            @rel="body" and 
            @cat="ssub"
        ]/node[
            @rel="vc" and 
            @cat="ppart"
        ]/node[
            @rel="hd" and 
            @pt="ww"
        ]/number(@begin)
    ] and 
    node[
        @rel="body" and 
        @cat="ssub" and 
        node[
            @rel="vc" and 
            @cat="ppart" and 
            node[
                @rel="hd" and 
                @pt="ww" and 
                number(@begin) < ../../node[
                    @rel="hd" and 
                    @pt="ww"
                ]/number(@begin)
            ]
        ] and 
        node[
            @rel="hd" and 
            @pt="ww"
        ]
    ]
]

When I remove the number comparisons, and simplify the query to the following, I do not get any results (expected) but at least I do not get any errors.

//node[
    @cat="cp" and 
    node[
        @rel="cmp" and 
        @pt="vg"
    ] and 
    node[
        @rel="body" and 
        @cat="ssub" and 
        node[
            @rel="vc" and 
            @cat="ppart" and 
            node[
                @rel="hd" and 
                @pt="ww"
            ]
        ] and 
        node[
            @rel="hd" and 
            @pt="ww"
        ]
    ]
]

So how can I use number() in my XPath in Python (preferably with lxml but I am open to other libraries too). And why does this work in BaseX but does not work in Python?


Solution

  • In XPath 1.0 a function call cannot appear on the right-hand-side of the "/" operator:

    number(@begin) < ../../node[
                        @rel="hd" and 
                        @pt="ww"
                    ]/number(@begin)
    

    With XPath 1.0 the operands of < are automatically converted to numbers, so you should be able to write

    @begin < ../../node[
                        @rel="hd" and 
                        @pt="ww"
                    ]/@begin
    

    By the way, with XPath questions do please use a version-specific tag - XPath 1.0, 2.0, and 3.0/3.1 are very different.