I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link>
tag.
The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string
, if it can't parse it I treat it as a website. Here is the code.
$xml = @simplexml_load_string( $site_found['content'] );
if( !$xml ) // this is a website, not a feed
{
// handle website
}
else
{
// parse feed
}
Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.
Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?
I would sniff for the various unique identifiers those formats have:
Atom: Source
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
RSS 0.90: Source
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">
Netscape RSS 0.91
<rss version="0.91">
etc. etc. (See the 2nd source link for a full overview).
As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed>
and <rss>
tags, respectively. Plus you won't find those in a valid HTML document.
You could make an initial check to tell HTML and feeds apart by looking for <html>
and <body>
elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)
If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.
what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.