robots.txt

Make PHP Web Crawler to Respect the robots.txt file of any website


I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.

I see that this is the robots.txt file structure:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

I can read, line by line and then use explode with space character as delimiter to find data.

Is there any other way that i can load the entire data ?

Does this kind of files have a language, like XPath has ?

Or do i have to interprete the entire file ?

Any help is welcomed, even links, duplicates if found ...


Solution

  • The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.