I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.
I see that this is the robots.txt file structure:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
I can read, line by line and then use explode with space character as delimiter to find data.
Is there any other way that i can load the entire data ?
Does this kind of files have a language, like XPath has ?
Or do i have to interprete the entire file ?
Any help is welcomed, even links, duplicates if found ...
The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.