regexpython-3.xscrapyscrapinghub

From local scrapy to scrapy cloud (scraping hub) - Unexpected results


The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version. My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients" and the field "list of prices" are always displayed as empty. You'll see in a picture attached the two elements I'm always having empty as a result whereas it's perfectly working I'mu using Python 3 and the stack was configured with a scrapy:1.3-py3 configuration. I thought first it was in a issue with the regex and unicode but seems not. So i tried everything : ur, ur RE.ENCODE .... and didn't work.

For the ingredients part, my code is the following :

    data_box=response.xpath('//*[@id="ingredients"]').css('div.information__tab__content *::text').extract()
    data_inter=''.join(data_box).strip()

    match1=re.search(r'([Ii]ngr[ée]dients\s*\:{0,1})\s*(.*)\.*',data_inter)
    match2=re.search(r'([Cc]omposition\s*\:{0,1})\s*(.*)\.*',data_inter)


    if match1:
        result_matching_ingredients=match1.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    elif match2 : 
        result_matching_ingredients=match2.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    else:
        result_matching_ingredients=''

    ingredients=result_matching_ingredients

It seems that the matching never occurs on scrapy cloud.

For prices, my code is the following :

    list_prices=[]

    for package in list_packaging : 
        tonnage=package.css('div.product__varianttitle::text').extract_first().strip()
        prix_inter=(''.join(package.css('span.product__smallprice__text').re(r'\(\s*\d+\,\d*\s*€\s*\/\s*kg\)')))
        prix=prix_inter.replace("(","").replace(")","").replace("/","").replace("€","").replace("kg","").replace(",",".").strip()

        list_prices.append(prix)

That's the same story. Still empty.

I repeat : it's working fine on my local version. Those two data are the only one causing issue : i'm extracting a bunch of other data (with Regex too) with scrapy cloud and I'm very satisfied with it ?

Any ideas guys ?

enter image description here


Solution

  • Check that Scrapping Hub’s logs are displaying the expected version of Python even if the stack is correctly set up in the project’s yml file.