I am running into some unexpected issues with XPath. I have a query that runs fine when using a database system like BaseX, but in Python with lxml
it throws an error.
Here is an example:
import lxml.etree as ET
xml = """<tree>
<node begin="0" cat="top" end="27" id="0" rel="top">
<node begin="0" end="1" frame="punct(aanhaal_both)" id="1" lcat="--" lemma="'" pos="punct" postag="LET()" pt="let" rel="--" root=""" sense=""" special="aanhaal_both" word="""/>
<node begin="12" end="13" frame="punct(komma)" id="2" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
<node begin="1" cat="du" end="26" id="3" rel="--">
<node begin="1" cat="du" end="20" id="4" rel="dp">
<node begin="1" cat="cp" end="12" id="5" rel="sat">
<node begin="1" end="2" frame="complementizer(al)" id="6" lcat="cp" lemma="al" pos="comp" postag="BW()" pt="bw" rel="cmp" root="al" sc="al" sense="al" word="Al"/>
<node begin="2" cat="sv1" end="12" id="7" rel="body">
<node begin="2" end="3" frame="verb(hebben,sg1,transitive)" id="8" infl="sg1" lcat="sv1" lemma="geven" pos="verb" postag="WW(pv,tgw,ev)" pt="ww" pvagr="ev" pvtijd="tgw" rel="hd" root="geef" sc="transitive" sense="geef" tense="present" word="geef" wvorm="pv"/>
<node begin="3" case="both" def="def" end="4" frame="pronoun(nwh,je,sg,de,both,def,wkpro)" gen="de" getal="ev" id="9" lcat="np" lemma="je" naamval="nomin" num="sg" pdtype="pron" per="je" persoon="2v" pos="pron" postag="VNW(pers,pron,nomin,red,2v,ev)" pt="vnw" rel="su" root="je" sense="je" special="wkpro" status="red" vwtype="pers" wh="nwh" word="je"/>
<node begin="4" cat="np" end="12" id="10" rel="obj1">
<node begin="4" end="5" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="11" lcat="np" lemma="programmeur" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="programmeur" sense="programmeur" word="programmeurs"/>
<node begin="5" cat="pp" end="12" id="12" rel="mod">
<node begin="5" end="6" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="13" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
<node begin="6" cat="np" end="12" id="14" rel="obj1">
<node begin="6" end="7" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="15" lcat="np" lemma="vertaalcomputer" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="vertaal_computer" sense="vertaal_computer" word="vertaalcomputers"/>
<node begin="7" cat="pp" end="12" id="16" rel="mod">
<node begin="7" end="8" frame="er_vp_adverb" getal="getal" id="17" lcat="advp" lemma="er" naamval="stan" pdtype="adv-pron" persoon="3" pos="adv" postag="VNW(aanw,adv-pron,stan,red,3,getal)" pt="vnw" rel="obj1" root="er" sense="er" special="er" status="red" vwtype="aanw" word="er"/>
<node begin="8" cat="np" end="11" id="18" rel="mod">
<node begin="8" end="9" frame="modal_adverb" id="19" lcat="advp" lemma="nog" pos="adv" postag="BW()" pt="bw" rel="mod" root="nog" sc="modal" sense="nog" word="nog"/>
<node begin="9" end="10" frame="number(hoofd(pl_num))" id="20" infl="pl_num" lcat="detp" lemma="vijftig" naamval="stan" numtype="hoofd" pos="num" positie="prenom" postag="TW(hoofd,prenom,stan)" pt="tw" rel="det" root="vijftig" sense="vijftig" special="hoofd" word="vijftig"/>
<node begin="10" end="11" frame="tmp_noun(het,count,meas)" gen="het" genus="onz" getal="ev" graad="basis" id="21" lcat="np" lemma="jaar" naamval="stan" ntype="soort" num="meas" pos="noun" postag="N(soort,ev,basis,onz,stan)" pt="n" rel="hd" root="jaar" sense="jaar" special="tmp" word="jaar"/>
</node>
<node begin="11" end="12" frame="preposition(bij,[vandaan])" id="22" lcat="pp" lemma="bij" pos="prep" postag="VZ(fin)" pt="vz" rel="hd" root="bij" sense="bij" vztype="fin" word="bij"/>
</node>
</node>
</node>
</node>
</node>
</node>
</node>
</node>
<node begin="26" end="27" frame="punct(punt)" id="42" lcat="--" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
</node>
</tree>
"""
xpath = """//node[@cat="cp" and node[@rel="cmp" and @pt="vg" and number(@begin) < ../node[@rel="body" and @cat="ssub"]/node[@rel="vc" and @cat="ppart"]/node[@rel="hd" and @pt="ww"]/number(@begin)] and node[@rel="body" and @cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pt="ww" and number(@begin) < ../../node[@rel="hd" and @pt="ww"]/number(@begin)]] and node[@rel="hd" and @pt="ww"]]]"""
root = ET.fromstring(xml)
results = root.xpath(xpath)
print(results)
Here is a beautified version of the XPath:
//node[
@cat="cp" and
node[
@rel="cmp" and
@pt="vg" and
number(@begin) < ../node[
@rel="body" and
@cat="ssub"
]/node[
@rel="vc" and
@cat="ppart"
]/node[
@rel="hd" and
@pt="ww"
]/number(@begin)
] and
node[
@rel="body" and
@cat="ssub" and
node[
@rel="vc" and
@cat="ppart" and
node[
@rel="hd" and
@pt="ww" and
number(@begin) < ../../node[
@rel="hd" and
@pt="ww"
]/number(@begin)
]
] and
node[
@rel="hd" and
@pt="ww"
]
]
]
When I remove the number comparisons, and simplify the query to the following, I do not get any results (expected) but at least I do not get any errors.
//node[
@cat="cp" and
node[
@rel="cmp" and
@pt="vg"
] and
node[
@rel="body" and
@cat="ssub" and
node[
@rel="vc" and
@cat="ppart" and
node[
@rel="hd" and
@pt="ww"
]
] and
node[
@rel="hd" and
@pt="ww"
]
]
]
So how can I use number()
in my XPath in Python (preferably with lxml but I am open to other libraries too). And why does this work in BaseX but does not work in Python?
In XPath 1.0 a function call cannot appear on the right-hand-side of the "/" operator:
number(@begin) < ../../node[
@rel="hd" and
@pt="ww"
]/number(@begin)
With XPath 1.0 the operands of <
are automatically converted to numbers, so you should be able to write
@begin < ../../node[
@rel="hd" and
@pt="ww"
]/@begin
By the way, with XPath questions do please use a version-specific tag - XPath 1.0, 2.0, and 3.0/3.1 are very different.