rssweb-crawlerfeedatom-feedscraper

Extract RSS Feed url from


I have 100 websites that have RSS feeds exposed in different locations. These locations have several RSS feed links pointing at different feeds. Its nearly identical to the BBC Rss feeds page http://www.bbc.com/news/10628494

Site 1 : domain1.com/rss Site 2 : domain2.com/enviroments/rss

Is there any way to extract out the rss links to the each feed xml.

Somthing similar to this Automatically Extracting feed links (atom, rss,etc) from webpages but I would like to only give the site. So that I get all possible rss feeds for a particular site.

I want to have a list of all the rss feeds from the 100 websites. So then I can monitor them on a dashboard. Oh the feeds aee mixed bith atom and rss.

What I have done. I have looked into apache nutch and the parse-feed plugin. Scrapy was the next option but I am still not sure this what I am looking for.


Solution

  • In general, a website that offers RSS feed(s) indicates so in the header of at least the home page, some every single page.

    There is an example of an RSS feed:

    <link href="http://snapwebsites.org/rss.xml"
          title="Snap! A C++ Open Source CMS RSS"
          type="application/rss+xml"
          rel="alternate">
    

    Note that the type will vary slightly between websites. For example some websites may use text instead of application (which is wrong, but XML is text...) There is also application/atom+xml. You may also have both formats.

    If that's not available, then you'd have to check the home page or other pages for anchor links to an RSS feed, which means:

    1. 'rss' -- RSS format (version is an attribute)
    2. 'feed' -- Atom format

    I have an example on the following page that includes the <link ...> tag in the header:

    http://snapwebsites.org/implementation/feature-requirements/feed-feature-core-atom-rss-20-etc

    I have to say, without that link, it will be quite a bit harder to find the RSS feeds. That being said, on many websites the feeds files make use of an extension (.rss, .atom, .xml) and that could be used to simplified the search. Yet, more and more, feeds look like directory names (.../blah or .../foo cannot be distinguished from a standard HTML page or a feed, so the only way is to read the file at the destination and check the file format; the Content-Type of the HTTP reply should be application/rss+xml or application/atom+xml too... like the header link type=... attribute)


    As a side note, although very unlikely (I've not really seen it on a live website), it is possible to use the Link: ... HTTP header to indicate... links just the same as the <link ...> tag found in the HTML header. If you have access to the HTTP header (here is how to do it in PHP), then it's worth looking for those headers to see whether one of them is an RSS feed.