python web-scraping cookies python-requests

Webscraping - Tesco Webshop prices - Python requests

I have been tracking beer webshop prices of Tesco for a while, but this Monday they have implemented something which I am struggling to resolve. I am using requests to get the html content and then scrapy to extract the required data. I have only been passing User-Agent in headers so far, which was enough to get a response, but now no respond.

The only way I could make my script working again is opening the link in browser manually, and copying the entire Cookie from Requests headers into my python script headers next to the User-Agent.

Yet I don't really know how I could generate the required part of this Cookie automatically. The script would be running automatically on a daily basis, but now it requires me to modify the headers cookie part every time.

Is there any way to automate this?

Thank you really much in advance! Mate

import requests
from scrapy import Selector

headers = {
             'Cookie': **'here comes the entire requests cookie string from browser manually'**,
             'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
 }

page_url = ('https://bevasarlas.tesco.hu/groceries/hu-HU/shop/alkohol/sor-cider/all')
html = requests.get(url = page_url, headers = headers, timeout = 10).content

sel = Selector(text = html)
scraped_data = sel.xpath('//li[contains(@class, "product-list--list-item")]')
card = scraped_data[0]
name = card.xpath('.//span[@class="styled__Text-sc-1xbujuz-1 ldbwMG beans-link__text"]/text()').extract()

Solution

If you look at the html from requests.get without sending a cookie:

<script>
   var i = 1686331817;
   var j = i + Number("6194" + "69765");
</script>

xhr.open("POST", "/_sec/verify?provider=interstitial", false);
xhr.setRequestHeader("Content-Type", "application/json");
xhr.send(JSON.stringify({
   "bm-verify": "AAQAAAAH/////7F2shgQtGib96HBSdl5...",
   "pow": j
}));

What happens is j is calculated and sent along with bm-verify to the/_sec/ URL.

This then gives you an authenticated cookie, and the web browser reloads the page.

You can implement these steps manually.

One other thing of note is that the Referer header appears to be checked, so you must set it.

import re
import requests
from   scrapy import Selector

site = 'https://bevasarlas.tesco.hu'

sec  = site + '/_sec/verify?provider=interstitial'
page = site + '/groceries/hu-HU/shop/alkohol/sor-cider/all'

headers = {
   'User-Agent': (
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' 
      '(KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
   ),
   'Referer': page,
}

r = requests.get(page, headers=headers)

html = r.content

# extract `i`, `j` and `bm-verify`
i = re.search(rb'var i = (\d+)', html)[1]
j = re.search(rb'var j = i [+] Number[(]"(\d+)" [+] "(\d+)"[)]', html)
j = j[1] + j[2]

payload = {
   'bm-verify': re.search(rb'"bm-verify"\s*:\s*"([^"]+)', html)[1].decode(),
   'pow': int(i) + int(j)
}

rr = requests.post(sec, cookies=r.cookies, json=payload, headers=headers)

rrr = requests.get(page, cookies=rr.cookies, headers=headers)

html = rrr.content
sel = Selector(text = html)

scraped_data = sel.xpath('//li[contains(@class, "product-list--list-item")]')
card = scraped_data[0]
name = card.xpath('.//span[@class="styled__Text-sc-1xbujuz-1 ldbwMG beans-link__text"]/text()').extract()

print(name)

Output:

['Rastinger világos sör 4% 500 ml']

You can probably clean up the code by using a Session object and letting that handle the cookies/headers for you.