htmlweb-scrapingxpath

Can XPath locate nodes after arbitrary occurrences of text in non-valid markup?


I have a document written by a naughty web developer, which looks something like:

<div id="details">
    Here is some text without a p tag. Oh, let's write some more.
    <br>
    <br>
    And some more.
    <table id="non-unique">
        ...
    </table>
    Replaces the following numbers:
    <table id="non-unique">
        ... good stuff in here
    </table>
</div>

So, it's not well marked up. I need to get hold of the table with the good stuff in it, however, it doesn't have a unique id value and it is not always in the same order, or last in the div etc.

The only running theme is that it always follows the text Replaces the following numbers:, though this text may be as it is in the example above, or sometimes in a h4 element!

Is it possible to use an XPath expression to wrangle this table out by searching for the replaces string and then asking for the next table element??

Thanks!


Solution

  • That looks valid to me:

    //text()[contains(.,"Replaces the following numbers")]/following-sibling::table[1]
    

    There's no rule that id's must be unique.