pythonweb-scrapingscrapyweb-crawler

Scrapy, only follow internal URLS but extract all links found


I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = someItem()
    item['url'] = response.url
    return item

What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!


Solution

  • You can also use the link extractor to pull all the links once you are parsing each page.

    The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors import LxmlLinkExtractor
    from myproject.items import someItem
    
    class someSpider(CrawlSpider):
      name = 'crawltest'
      allowed_domains = ['someurl.com']
      start_urls = ['http://www.someurl.com/']
    
      rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)
    
    
      def parse_obj(self,response):
        for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
            item = someItem()
            item['url'] = link.url