phpdatabaseweb-scrapingdataset

How to extract content or scrape data sets from website source page


I would like to know how to scrape the content of the source code from website using PHP. I have tried using http://simplehtmldom.sourceforge.net/ and also looked at How do you parse and process HTML/XML in PHP? I'm still having hard time trying to get info from the source code. As you can see the main page of the source code contain the link list of author which include the year and the number of books wrote.

<div id="fleft">
    <ul>
    <li><a href="http://www.books.com/john-smith/index.html">John Smith (2011-2012)</a> : 11 books
    <li><a href="http://www.books.com/bobby-bob/index.html">Bobby Bob (2011-2012)</a> : 89 books
    ....
    </ul>
    </div>

I click on john smith it would open the list of books that john smith wrote.

 <h1>John Smith (11 Books)</h1>
    <div id="fleft">
    
    <ul>
    <li><a href="http://www.books.com/john-smith/best-book.html">Best Book</a>
    <li><a href="http://www.books.com/john-smith/other-best-book.html">Other Best Book</a>
....
    </ul>
    </div>

I click in one of the book "best book" it would show the title of the book and aurther and the whole story of the book.

<div id="bookbox">
<h1>Book : Best Book</h1>

<h2>Aurther : John Smith</h2>
<pre>
story of the best book......
.......
....
the end
</pre>

I would like to be able to grab all the author name and the their year, and list of books, and the content of the book. Actually as dataset. I would like to create a database of the information of all the author's name, year of their lives, books they created, books title, category, books content, etc.


Solution

  • you should mention what approach you are using to get html of target page, i suppose that you have html of target page in $targetHTML variable

    you cand load it in dom like this

    /*********** Load In Dom *********/
    $html = new DOMDocument;
    $html->loadHTML($targetHTML);
    $xPath = new DOMXPath($html);
    /*********** Load In Dom *********/
    

    you can use xpath to fetch your desired data from html loaded in dom.

    If you are using this approach already you can show your code to find out problem.

    Regards