phprssfeedparseratom-feed

How to detect if a page is an RSS or ATOM feed


I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.

The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?


Solution

  • I would sniff for the various unique identifiers those formats have:

    Atom: Source

    <?xml version="1.0" encoding="utf-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom">
    

    RSS 0.90: Source

    <rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns="http://my.netscape.com/rdf/simple/0.9/">
    

    Netscape RSS 0.91

    <rss version="0.91">
    

    etc. etc. (See the 2nd source link for a full overview).

    As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

    You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

    If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

    what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.