pythonweb-scrapingscrapyweb-crawler

how to crawl a site only given domain url with scrapy


I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?

I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.


Solution

  • I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:

    allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.