pythonweb-scrapingxpathminidom

Need help writing an xpath string to match multiple, but not all, table cells


Note: The question has been updated since some of the early answers were given. It's still the same question, just hopefully clearer.

I'm trying to get a site scraper working properly and I'm having problems coming up with a suitable xpath string for some table cells.

<tbody>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Interesting section</td>
    <td class="Data"> I want this-1</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-2</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-n</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
</tbody>

I want the contents of all the Data fields in the interesting section. There can be an arbitrary number of these. I don't care about anything else in the code, but I need all these.

In the example above: I want this-1 I want this-2 I want this-n

If it's relevant, I'm using xml.dom.minidom and py-dom-xpath with Python 2.7.


Solution

  • You can get all the n tds after the section (including other sections) with

     //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text()
    

    Then you can get all the m tds of the next sections that you don't want with

    //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()
    

    and then you can use in Python the first n - m tds.

    You could try to do the same in XPath with the position and count functions:

      //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"][position() <= (count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text())  - count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()) )]/text()
    

    And if you had XPath 2.0 you could do it elegant with the except operator:

     //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text() except  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()