As of recently Scrapinghub no longer has periodic jobs in their free package, which is what I used to use to run my Scrapy crawlers.
Therefore, I decided to use Scrapyd instead. So I went ahead and got a virtual server running Ubuntu 16.04. (This is my first time setting up and running a server, so please bear with me)
Following the instructions on scrapyd.readthedocs.io I installed Scrapyd using pip:
$ pip install scrapyd
(That was after I figured out that the recommended way for Ubuntu, using apt-get, is actually no longer supported, see Github).
Then I log onto my server using SSH, and run Scrapyd by simply running
$ scrapyd
Everything looks fine as far as I can tell:
2017-10-30 17:31:19+0000 [-] Log opened.
2017-10-30 17:31:19+0000 [-] twistd 16.0.0 (/usr/bin/python 2.7.12) starting up.
2017-10-30 17:31:19+0000 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-10-30 17:31:19+0000 [-] Site starting on 6800
2017-10-30 17:31:19+0000 [-] Starting factory <twisted.web.server.Site instance at 0x7f644752bfc8>
2017-10-30 17:31:19+0000 [Launcher] Scrapyd 1.2.0 started: max_proc=4, runner=u'scrapyd.runner'
I would expect to see a web interface (described here) when I go to my IP at http://82.165.102.18:6800.
Instead, I just get the error message "This site can’t be reached 82.165.102.18 refused to connect."
When I try to run Scrapyd locally, everything works just fine, and I get the web interface at http://localhost:6800/.
I have tried disabling the Firewall (UFW), but that didn't help.
At this point, I am lost. If you have any ideas, please let me know!
Thanks a lot!
If you can reach your Scrapyd instance locally but not over network, I suspect Scrapyd listens only on localhost. Be sure to have this line in your scrapyd.conf
:
bind_address = 0.0.0.0
It instructs Scrapyd to listen on all interfaces. bind_address
defaults to 127.0.0.1
, so by default it only listens on localhost.