My goal is to get the text:
27. The method according to claim 23 wherein...
How do I go about retrieving the text inside a tag that contains <?
. I believe they are called php short tags from googling it.
I am using a lxml, xpaths and they seem to just not register it as a tag or a node. I tried itertext() but that doesnt work as well.
<claim id="CLM-00027" num="00027">
<claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text>
</claim>
Here's a piece of code that does that, using XPath to reach the deepest 'valid' tag, and then getchildren
and tail
to dive deeper from there all the way to the actual text.
import lxml
xml=""" <claim id="CLM-00027" num="00027">
<claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text>
</claim>"""
root = lxml.etree.fromstring(xml)
e = root.xpath("/claim/claim-text")
res = e[0].getchildren()[0].tail
print(res)
Output:
'27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.