htmlxmlxpathwebharvest

Extract data from html/xml


I'm using Webharvest to retrieve data from websites. It converts the html pages to xml documents before getting for me the wanted data based on the xPath provided.

Now I'm working on a page like this: pastebin Where I showed the blocks I'd like to get. Each block should be returned as a single unit.

the xPath the first element of the block is: //div[@id="layer22"]/b/span[@style="background-color: #FFFF99"] I tested it and it gives all "bloc start" elements.

the xPath of the last element of the block is: //div[@id="layer22"]/a[contains(.,"Join")] I tested it and it gives all the "bloc end" elements.

The xPath should return a set of blocks as:

(xPath)[1] = block 1

(xPath)[2] = block 2

....

Thank you in advance


Solution

  • Use (for the first wanted result):

      ($first)[1] | ($last)[1]
    
    |
    
      ($first)[1]/following::node()
           [count(.|($last)[1]/preceding::node()) = count(($last)[1]/preceding::node())]
    

    where you need to substitute $first with:

    //div[@id="layer22"]/b/span[@style="background-color: #FFFF99"]
    

    and substitute $last with:

    //div[@id="layer22"]/a[contains(.,"Join")] 
    

    To get the k-th result, substitute in the final expression ($first)[1] with ($first)[{k}] and ($last)[1] with ($last)[{k}], where {k} should be replaced by the number k.

    This technique follows directly from the well-known Kayessian formula for set intersection in XPath 1.0:

    $ns1[count(.|$ns2) = count($ns2)]
    

    which selects the intersection of the two node-sets $ns1 and $ns2 .

    Here is XSLT verification with a simple example:

    <nums>
      <num>01</num>
      <num>02</num>
      <num>03</num>
      <num>04</num>
      <num>05</num>
      <num>06</num>
      <num>07</num>
      <num>03</num>
      <num>07</num>
      <num>10</num>
    </nums>
    

    This transformation:

    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
    
     <xsl:variable name="v1" select=
      "(//num[. = 3])[1]/following-sibling::*"/>
     <xsl:variable name="v2" select=
      "(//num[. = 7])[1]/preceding-sibling::*"/>
    
     <xsl:template match="/">
      <xsl:copy-of select=
      "$v1[count(.|$v2) = count($v2)]"/>
     </xsl:template>
    </xsl:stylesheet>
    

    applies the XPath expression and the selected nodes are copied to the output:

    <num>04</num>
    <num>05</num>
    <num>06</num>