phpweb-scrapingsimple-html-domsimpledom

PHP Simple HTML DOM Parser: how to get contents of the parent div containing <h1> tags?


I'm scraping (using PHP simple HTML DOM) a number of different (news) sites with the aim of getting the main content/body of text on the page.

To do this the best way i could figure out was to find the main header/headline (H1) and to get the text contained within the same div as this header tag.

How would i go about getting the contents of the whole (parent?) div, in both examples below.

<div>  <----- need to get contents of this whole div (containing the h1 and likely the main body of text)
  <h1></h1>
  main body of text here
</div>

Div maybe be further up the tree.

<div> <----- need to get contents of this whole div
  <div>   
    <h1></h1>
  </div>

  <div>
    main body of text here
  </div>
</div>

Div even further up the tree.

<div> <----- need to get contents of this whole div
  <div>

    <div>   
      <h1></h1>
    </div>

    <div>
      main body of text here
    </div>

  </div>
</div>

Then i could compare the size of each, and determine the main div.


Solution

  • Assuming $e contains the H1 element that you selected. You can call $e->parent() to grab the parent element.

    Look under "How to traverse the DOM tree?" on the "Traverse the DOM tree" tab. http://simplehtmldom.sourceforge.net/manual.htm