web-scrapingxpathhtml-agility-pack

Need query for XPath that finds all <tr> elements that contain 7 <td> elements


Hello and hopefully thanks for the help.

Honestly I am not very experienced at XPath and I am hoping a guru out there will have a quick answer for me.

I am scraping a web page for data. The defining aspect of the data I want is that it is contained in a row <tr> that has 7 <td> elements. Each <td> element has one of the pieces of data I need to import. I am using the HTML Agility Pack on CodePlex to grab the data, but I can't seem to figure out how to define the query.

Contained in the web page is a section like this:

<table border="0" cellpadding="3" cellspacing="1" width="100%">
  <tr class="bgWhite" xmlns:msxsl="urn:schemas-microsoft-com:xslt">
    <td class="dataHdrText02" valign="top" width="50" align="center"><nobr>SYMBOL</nobr></td>
    <td class="dataHdrText02" valign="top" align="center">PERIOD</td>
    <td class="dataHdrText02" valign="top" align="center" width="*">EVENT TITLE</td>
    <td class="dataHdrText02" valign="top" align="center">EPS ESTIMATE</td>
    <td class="dataHdrText02" valign="top" align="center">EPS ACTUAL</td>
    <td class="dataHdrText02" valign="top" align="center">PREV. YEAR ACTUAL</td>
    <td class="dataHdrText02" valign="top" align="center"><nobr>DATE/TIME (ET)</nobr></td>
  </tr>
  <tr class="bgWhite">
    <td align="center" width="50"><nobr>CSCO&#160;</nobr></td>
    <td align="center">Q4&#160;2011</td>
    <td align="left" width="*">Q4 2011 CISCO Systems Inc Earnings Release</td>
    <td align="center">$ 0.38&#160;</td>
    <td align="center">n/a&#160;</td>
    <td align="center">$ 0.43&#160;</td>
    <td align="center"><nobr>10-Aug-11</nobr></td>
  </tr>
  <tr class="bgWhite">
    <td align="center" width="50"><nobr>CSCO &#160;</nobr></td>
    <td align="center">Q3&#160;2011</td>
    <td align="left" width="*">Q3 2011 Cisco Systems Earnings Release</td>
    <td align="center">$ 0.37&#160;</td>
    <td align="center">$ 0.42&#160;</td>
    <td align="center">$ 0.42&#160;</td>
    <td align="center"><nobr>11-May-11 AMC</nobr></td>
  </tr>
  <tr class="bgWhite" xmlns:msxsl="urn:schemas-microsoft-com:xslt">
     <td align="center" colspan="7"><img src="/format/cb/images/spacer.gif" width="1" height="4"></td>
  </tr>
</table>

My goal is to grab the earnings event data and place it into a database for analysis. My original thought was to grab all <tr> elements with 7 <td> elements then work with that data. Any advice or alternative suggestions would be welcome.


Solution

  • This should do it for you.

    //tr[count(td)=7]