I am trying to run my scrapy from script. I am using CrawlerProcess and I only have one spider to run.
I've been stuck from this error for a while now, and I've tried a lot of things to change the settings, but every time I run the spider, I get
twisted.internet.error.ReactorNotRestartable
I've been searching to solve this error, and I believe you should only get this error when you try to call process.start() more than once. But I didn't.
Here's my code:
import scrapy
from scrapy.utils.log import configure_logging
from scrapyprefect.items import ScrapyprefectItem
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = ['http://www.nigeria-law.org/A.A.%20Macaulay%20v.%20NAL%20Merchant%20Bank%20Ltd..htm']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def parse(self, response):
item = ScrapyprefectItem()
...
yield item
process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()
Error:
Traceback (most recent call last):
File "/Users/pluggle/Documents/Upwork/scrapyprefect/scrapyprefect/spiders/spider.py", line 59, in <module>
process.start()
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
I notice that this only happens when I'm trying to save my items to mongodb. pipeline.py:
import logging
import pymongo
class ScrapyprefectPipeline(object):
collection_name = 'SupremeCourt'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
# pull in information from settings.py
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
# initializing spider
# opening db connection
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
# clean up when spider is closed
self.client.close()
def process_item(self, item, spider):
# how to handle each post
self.db[self.collection_name].insert(dict(item))
logging.debug("Post added to MongoDB")
return item
If I change the pipeline.py to the default, which is...
import logging
import pymongo
class ScrapyprefectPipeline(object):
def process_item(self, item, spider):
return item
...the script runs fine. I'm thinking this has something to do with how I setup the pycharm settings to run the code. So for referece, I'm also including my pycharm settings
I hope someone can help me. Let me know if need more details
Okay. I solved it. So I think, in the pipeline, when the scraper enters the open_spider, it runs the spider.py again, and calling the process.start() the second time.
To solve the problem, I add this in the spider so process.start() will only be executed when you run the spider:
if __name__ == '__main__':
process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()