web-scrapingscrapyscrapinghub

Want to understand Robots.txt


I would like to scrape a website. However I want to make sense of the robots.txt before I do. The lines that I don't understand are

User-agent: *
Disallow: /*/*/*/*/*/*/*/*/
Disallow: /*?&*&*
Disallow: /*?*&*
Disallow: /*|*

Does the User Agent Line mean access is ok anywhere? But then I have the Disallow line which is the main one I am concerned about. Does it mean don't access 8 layers deep, or don't access at all?


Solution

  • I believe one simply interpret the robot.txt file with regex. The star can usually be interpreted as anything/everything.

    The User-Agent line User-agent: * does not mean you are allowed to scrape everything, it simply means the following rules apply to all user-agents. Here are examples of User-Agents

    # Chrome Browser
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
    # Python requests default
    python-requests/2.19.1
    

    which must comply with the same rules, that is:

    Finally, here are insightful examples and more on the topic.