pythonweb-scrapingscrapysplash-js-render

Scrapy and Incapsula


I'm trying to use Scrapy with Splash to retrieve data from the website "whoscored.com". Here is my settings:

BOT_NAME = 'scrapy_matchs'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 20
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_matchs.pipelines.ScrapyMatchsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 30
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Before this I was using only Splash and I was able to at least request 2 or 3 pages before I got blocked by Incapsula. But with Scrapy, I got blocked instantly after the first request I do.

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=22&amp;xinfo=14-58014137-0%200NNN%20RT%281572446923864%2084%29%20q%280%20-1%20-1%202%29%20r%280%20-1%29%20B17%284%2c200%2c0%29%20U18&amp;incident_id=727001300034907080-167681622137047086&amp;edet=17&amp;cinfo=04000000&amp;rpinfo=0" width="100%">
   Request unsuccessful. Incapsula incident ID: 727001300034907080-167681622137047086
  </iframe>
 </body>
</html>

Why I get blocked that easily? Should I change my settings?

Thank you in advance.


Solution

  • Could be possible that they had logged your previous scraping activities? That Scrapy is not responsible? At all?

    USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'
    

    This part also made me think of my own web servers log files that have urls like github.com/masscan in them. If the domain was associated with scraping or if it contained the phrase scrapy I wouldn't feel bad about banning them. Definitely follow robots.txt rules, bot don't check it it will make you look bad ;) and I wouldn't use so many user agents. I also like the idea of getting the default headers for the site and putting those up instead of your own. If I had a site getting hit with a lot of crawling content I could imagine filtering users based on if they had request headers that looked odd/off.

    I suggest you...

    1. nmap scan the site finding out what web server they use.
    2. install and set it up on your local computer with the most basic settings.(Turn on all loging params, most servers have some turned off)
    3. check the log files for that server and examine what your scraping traffic looks like vs your browser connecting to the site.
    4. Then figure out how to make the former look exactly like the latter.
    5. If none of that work in alleviating the issue don't use scrapy just use selenium with a real user agent automatically going through the site with your crawling code running on the pages got by user automation.
    6. I would also suggest you use a different ip by proxy or other methods because it seems your ip could be on some ban list somewhere.
    7. AWS free version would be an easy way of checking the sites security if they allow you to connect to the site through a ssh proxy port you set on your computer connected to the AWS server then that means they haven't banned the AWS server you are using which I assume means they have lacks security because basically ever AWS server on Earth it seems scans my Pi daily.
    8. Doing this work at a library next to a Starbucks next to a ... with free wifi and different ip addresses would be good.