pythonscrapypycharmtwisted

Scrapy: twisted.internet.error.ReactorNotRestartable from running CrawlProcess()


I am trying to run my scrapy from script. I am using CrawlerProcess and I only have one spider to run.

I've been stuck from this error for a while now, and I've tried a lot of things to change the settings, but every time I run the spider, I get

twisted.internet.error.ReactorNotRestartable

I've been searching to solve this error, and I believe you should only get this error when you try to call process.start() more than once. But I didn't.

Here's my code:

import scrapy
from scrapy.utils.log import configure_logging

from scrapyprefect.items import ScrapyprefectItem
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    start_urls = ['http://www.nigeria-law.org/A.A.%20Macaulay%20v.%20NAL%20Merchant%20Bank%20Ltd..htm']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def parse(self, response):
        item = ScrapyprefectItem()
        ...

        yield item


process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()

Error:

Traceback (most recent call last):
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/scrapyprefect/spiders/spider.py", line 59, in <module>
    process.start()
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

I notice that this only happens when I'm trying to save my items to mongodb. pipeline.py:

import logging
import pymongo


class ScrapyprefectPipeline(object):
    collection_name = 'SupremeCourt'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        # pull in information from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        # initializing spider
        # opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        # clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        # how to handle each post
        self.db[self.collection_name].insert(dict(item))
        logging.debug("Post added to MongoDB")
        return item

If I change the pipeline.py to the default, which is...

import logging
import pymongo

class ScrapyprefectPipeline(object):
    def process_item(self, item, spider):
        return item

...the script runs fine. I'm thinking this has something to do with how I setup the pycharm settings to run the code. So for referece, I'm also including my pycharm settings enter image description here

I hope someone can help me. Let me know if need more details


Solution

  • Okay. I solved it. So I think, in the pipeline, when the scraper enters the open_spider, it runs the spider.py again, and calling the process.start() the second time.

    To solve the problem, I add this in the spider so process.start() will only be executed when you run the spider:

    if __name__ == '__main__':
        process = CrawlerProcess(settings=get_project_settings())
        process.crawl('spider')
        process.start()