Note: The question has been updated since some of the early answers were given. It's still the same question, just hopefully clearer.
I'm trying to get a site scraper working properly and I'm having problems coming up with a suitable xpath string for some table cells.
<tbody>
<tr>
<td class="Label" width="20%" valign="top">Uninteresting section</td>
<td class="Data"> I don't care about this</td>
</tr>
<tr>
<td></td>
<td class="Data"> I don't care about this</td>
</tr>
<tr>
<td class="Label" width="20%" valign="top">Interesting section</td>
<td class="Data"> I want this-1</td>
</tr>
<tr>
<td></td>
<td class="Data"> I want this-2</td>
</tr>
<tr>
<td></td>
<td class="Data"> I want this-n</td>
</tr>
<tr>
<td class="Label" width="20%" valign="top">Uninteresting section</td>
<td class="Data"> I don't care about this</td>
</tr>
<tr>
<td></td>
<td class="Data"> I don't care about this</td>
</tr>
</tbody>
I want the contents of all the Data fields in the interesting section. There can be an arbitrary number of these. I don't care about anything else in the code, but I need all these.
In the example above: I want this-1 I want this-2 I want this-n
If it's relevant, I'm using xml.dom.minidom and py-dom-xpath with Python 2.7.
You can get all the n tds after the section (including other sections) with
//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text()
Then you can get all the m tds of the next sections that you don't want with
//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()
and then you can use in Python the first n - m tds.
You could try to do the same in XPath with the position and count functions:
//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"][position() <= (count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text()) - count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()) )]/text()
And if you had XPath 2.0 you could do it elegant with the except
operator:
//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text() except //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()