phpweb-scrapingrss

scraping a non RSS page to generate a feed


I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed.

I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job?

(Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how to get it to respond to changes on the page automatically)


Solution

  • Depending on the system it may or may not be easy to tell when the page was updated last.

    To check for changes, you can check the HTTP headers for the Last-Modified header of the page. Not all systems update the header properly, so it may not be useful. It's also possible that unmodified page will return a status of 304 (Not Modified), particularly if you provide a If-Modified-Since header in your request.

    I would definitely run something like this on a cron job. While it might be possible do it just from the headers, if you have to update the page your user will be waiting a long time (in relative terms) for your server to go out, get the page, do the processing, and send the response. I would be surprised if you didn't run into time outs from time to time with a non-cron based a approach.