xpathxpath-1.0

Match elements containing both specific text and any number of any digits


I'm looking for an XPath 1.0 query that would look somewhat like this:

//*[contains(text(), 'EXAMPLE') and translate(text(), translate(text(), '0123456789', ''), '') != '']

But the problem with this query is that there is no longer the text "EXAMPLE" after it performs the second part of the query, so it ultimately fails.

What I need is to match all elements that contain both the text "EXAMPLE" and any number of any digits.

I'm using Octoparse and it only supports XPath 1.0. Is that possible to achieve at all? I've tried asking ChatGPT a thousand times about this, but it keeps giving me the same illogical queries like the one above, which cannot work due to violating basic logic.


Solution

  • You haven't provided your source XML, so it's not possible to be sure why your query doesn't work for you. My guess, though, is that your bug is due to the behaviour of the translate and contains functions when their first parameter is a nodeset containing more than one item. The String Functions section of the XPath 1.0 spec says:

    A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.

    So if your XML looked like this:

    <root>
       text node 1
       <child>blah</child>
       EXAMPLE 123
    </root>
    

    ... then the following expression would return false:

    contains(text(), 'EXAMPLE')
    

    ... because the expression text() would return two text nodes which are children of <root>, and the EXAMPLE text is contained in the second of those text nodes.

    Perhaps you should be checking the string-value of the elements themselves, rather than the value of their first text node? If you pass the element itself as the first parameter to the string functions, then that will be converted to a concatenation of all the text nodes contained within that element (including inside child elements). In XPath 1.0 it's not possible to concatenate just the child text nodes and exclude the text nodes within child elements.

    e.g. you could try:

    //*[contains(., 'EXAMPLE') and translate(., translate(., '0123456789', ''), '') != '']
    

    NB the result of this expression would include not just the leaf elements for which this is true, but also all the ancestors of that element, right up to the root element. To exclude those ancestor elements, you could use this expression:

    //*
       [contains(., 'EXAMPLE') and translate(., translate(., '0123456789', ''), '') != '']
       [not(
          *[contains(., 'EXAMPLE') and translate(., translate(., '0123456789', ''), '') != '']
       )]
    

    That would exclude an element whose text value matched your criteria if it also contained a child element whose text value matched those criteria.