pythonweb-scrapingscrapyscrapyd

Scrapyd - URL parsing problem when passed as a spider argument


I added the following code in my Spider class to be able to pass the URL as an argument:

def __init__(self, *args, **kwargs):
  super(MySpider, self).__init__(*args, **kwargs)
  self.start_urls = [kwargs.get('target_url').replace('\\', '')]

(The replace function is to remove the backslashes introduced by terminal escaping).

The spider recognizes the url, starts parsing and closes perfectly locally when I run :

scrapy crawl my_spider -a target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"

However, when I do the same thing through scrapyd, and I run:

curl https://my_spider.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"

I get an error because the url isn't parsed the same way as when using scrapy crawl.

LOG:

2019-08-08 22:52:34 [scrapy.core.engine] INFO: Spider opened
2019-08-08 22:52:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-08 22:52:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-08 22:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/list.htm?tri=initial> (referer: http://www.example.com)
2019-08-08 22:52:34 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-08 22:52:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 35684,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.680357,

After some experimentation, I discovered that for some reason, when passing the URL as a spider argument through scrapyd, it stops parsing whenever it reaches a & character.

Any insights as to how to remediate this behavior?


Solution

  • I managed to solve my problem. It was with the way the POST request was being sent through cURL, not with Scrapyd.

    After inspection of this request:

    curl -v  http://example.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377" --trace-ascii /dev/stdout
    

    I got:

    Warning: --trace-ascii overrides an earlier trace/verbose option
    == Info:   Trying 52.45.74.184...
    == Info: TCP_NODELAY set
    == Info: Connected to example.herokuapp.com (52.45.74.184) port 80 (#0)
    => Send header, 177 bytes (0xb1)
    0000: POST /schedule.json HTTP/1.1
    001e: Host: example.herokuapp.com
    0043: User-Agent: curl/7.54.0
    005c: Accept: */*
    0069: Content-Length: 164
    007e: Content-Type: application/x-www-form-urlencoded
    00af:
    => Send data, 164 bytes (0xa4)
    0000: project=default&spider=example&target_url=https://www.example.co
    0040: m/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000
    0080: &idtt=2,5&naturebien=1,2,4&ci=910377
    == Info: upload completely sent off: 164 out of 164 bytes
    

    Apparently, since the POST request is sent like this:

    http://example.herokuapp.com/schedule.json?project=default&spider=example&target_url=https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377 
    

    Whenever there is a &, it is considered as a new argument. So the URL part that gets taken into the target_url argument is only https://www.example.com/list.htm?tri=initial and the rest is considered another argument of the POST request.

    After using Postman and trying the following POST request:

    POST /schedule.json HTTP/1.1
    Host: example.herokuapp.com
    Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW
    cache-control: no-cache
    Postman-Token: 004990ad-8f83-4208-8d36-529376b79643
    
    Content-Disposition: form-data; name="project"
    
    default
    
    Content-Disposition: form-data; name="spider"
    
    my_spider
    
    Content-Disposition: form-data; name="target_url"
    
    https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
    ------WebKitFormBoundary7MA4YWxkTrZu0gW--
    

    It worked, and the job started successfully on Scrapyd!

    Through cURL, using -F instead of -d worked perfectly:

    curl https://example.herokuapp.com/schedule.json -F project=default -F spider=my_spider -F target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377"