I'm scraping (using PHP simple HTML DOM) a number of different (news) sites with the aim of getting the main content/body of text on the page.
To do this the best way i could figure out was to find the main header/headline (H1) and to get the text contained within the same div as this header tag.
How would i go about getting the contents of the whole (parent?) div, in both examples below.
<div> <----- need to get contents of this whole div (containing the h1 and likely the main body of text)
<h1></h1>
main body of text here
</div>
Div maybe be further up the tree.
<div> <----- need to get contents of this whole div
<div>
<h1></h1>
</div>
<div>
main body of text here
</div>
</div>
Div even further up the tree.
<div> <----- need to get contents of this whole div
<div>
<div>
<h1></h1>
</div>
<div>
main body of text here
</div>
</div>
</div>
Then i could compare the size of each, and determine the main div.
Assuming $e contains the H1 element that you selected. You can call $e->parent() to grab the parent element.
Look under "How to traverse the DOM tree?" on the "Traverse the DOM tree" tab. http://simplehtmldom.sourceforge.net/manual.htm