I use scrapy and scrapyd and send some custom settings via api (with Postman software).
Photo of the request:
For example, I send the value of start_urls
through api and it works correctly.
Now the problem is that I cannot apply the settings that I send through the api in my crawl.
For example, I send the CONCURRENT_REQUESTS
value, but it is not applied.
If we can bring self in the update_settings
function, the problem will be solved, but an error will occur.
My code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from kavoush.lxmlhtml import LxmlLinkExtractor as LinkExtractor
from kavoush.items import PageLevelItem
my_settings = {}
class PageSpider(CrawlSpider):
name = 'github'
def __init__(self, *args, **kwargs):
self.start_urls = kwargs.get('host_name')
self.allowed_domains = [self.start_urls]
my_settings['CONCURRENT_REQUESTS']= int(kwargs.get('num_con_req'))
self.logger.info(f'CONCURRENT_REQUESTS? {my_settings}')
self.rules = (
Rule(LinkExtractor(allow=(self.start_urls),deny=('\.webp'),unique=True),
callback='parse',
follow=True),
)
super(PageSpider, self).__init__(*args, **kwargs)
#custom_settings = {
# 'CONCURRENT_REQUESTS': 4,
#}
@classmethod
def update_settings(cls, settings):
cls.custom_settings.update(my_settings)
settings.setdict(cls.custom_settings or {}, priority='spider')
def parse(self,response):
loader = ItemLoader(item=PageLevelItem(), response=response)
loader.add_xpath('page_source_html_lang', "//html/@lang")
yield loader.load_item()
def errback_domain(self, failure):
self.logger.error(repr(failure))
Expectation:
How can I change the settings through api and Postman?
I brought CONCURRENT_REQUESTS
settings as an example in the above example, in some cases up to 10 settings may need to be changed through api.
Update:
If we remove my_settings = {}
and update_settings
and the commands are as follows, an error occurs (KeyError: 'CONCURRENT_REQUESTS'
) when running scrapyd-deploy
because CONCURRENT_REQUESTS
does not have a value at that moment.
Part of the above scenario code:
class PageSpider(CrawlSpider):
name = 'github'
def __init__(self, *args, **kwargs):
self.start_urls = kwargs.get('host_name')
self.allowed_domains = [self.start_urls]
my_settings['CONCURRENT_REQUESTS']= int(kwargs.get('num_con_req'))
self.logger.info(f'CONCURRENT_REQUESTS? {my_settings}')
self.rules = (
Rule(LinkExtractor(allow=(self.start_urls),deny=('\.webp'),unique=True),
callback='parse',
follow=True),
)
super(PageSpider, self).__init__(*args, **kwargs)
custom_settings = {
'CONCURRENT_REQUESTS': my_settings['CONCURRENT_REQUESTS'],
}
thanks to everyone
I can with 100% confidence say that in scrapy user does't have possibility to update spider settings during runtime (from spider.__init
as attepmted on code from question)
By the moment when spider.__init__
method called scrapy application already initialised the process using settings received earlier: from base settings, project settings, spider's custom_settings(that hardcoded in spider's source code.
Related issues on scrapy github issue tracker:
__init__
or from_crawlerAccording to scrapyd docs to transfer scrapy settings is require to set setting=DOWNLOAD_DELAY=2
in query to scrapyd/schedule. As far as I know this is the only supported way to transfer settings in scrapyd.