web-scrapingweb-crawlerpython-requests-html

Opening a link after manually building it works, but from code it does not


I have a website that uses 2 API calls in order to build the actual link to download a gzip file, the problem is that the headers are changing a lot I think and the cookies too, I tried finding out which ones are the ones that stay the same and in what way the other fields are changing but no luck so far. The website is this

I'm using the urlib library as follows:

request = urllib.request.Request(link, headers = headers)
response = urllib.request.urlopen(request)

Solution

  • I've used the following website to find out the proper headers that should go into the requests you'll do.

    Usually, you don't need to send the cookies in the request, but for some reason,, your websites sometimes didn't work without them (or it might be another reason too).

    I used this code in order to get the cookie's value:

    import http.cookiejar
    
    cookie_value = ""
    cookie_jar = http.cookiejar.CookieJar()
    cookie_processor = urllib.request.HTTPCookieProcessor(cookie_jar)
    opener = urllib.request.build_opener(cookie_processor)
    response = opener.open(base_url)
    for cookie in cookie_jar:
        cookie_value += f"{cookie.name}={cookie.value},"
    cookie_value = cookie_value[:-1]
    

    Then I inserted it into the headers I got from the above website:

    headers1 = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Cookie": f"{cookie_value}",
        "Host": "prices.super-pharm.co.il",
        "Referer": f"{base_url}/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }
    headers2 = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'Connection': 'keep-alive',
        'Referer': f'{base_url}/',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
        "Cookie": f"{cookie_value}",
    }
    

    Then I used the following code in order to get the final API URL for the gzip file:

    link1 = f"{base_url}{href1}"
    request = urllib.request.Request(link1, headers=headers1)
    response = urllib.request.urlopen(request)
    res = response.read()
    _decode = ast.literal_eval(res.decode("utf-8"))
    _decode = _decode['href']
    link2 = f"http://prices.super-pharm.co.il{_decode}".replace('\/', '//')
    request = urllib.request.Request(link2 , headers = headers2)
    response = urllib.request.urlopen(request)
    

    And finally you can use the response in order to get your file.