web-scrapingdynamicxmlhttprequestnonce

Scraping a website with dynamic wdtNonce parameter


I am pretty much self taught in webpage scraping and I don't really have a deep understanding of the inner workings of a webpage.

However, I've been able to scrape all websites I've put my hands on.

Until I tried this one.

My goal is to be able to choose the date and download the corresponding prices.

By examining the network traffic, I have been able to replicate the HTTP Request that yields the desired response in JSON format.

The aforementioned request's payload looks like this:

    {
    "draw": "5",
    "columns[0][data]": "0",
    "columns[0][name]": "wdt_ID",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "false",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "date",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "false",
    "columns[1][search][value]": "26+Feb+2024|26+Feb+2024",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "mtu",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "false",
    "columns[2][search][value]": "|",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "almcpmwh",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "false",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "alvolumemwh",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "false",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "alnetpositionmwh",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "false",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "ksmcpmwh",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "false",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "ksvolumemwh",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "false",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "8",
    "columns[8][name]": "ksnetpositionmwh",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "false",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "columns[9][data]": "9",
    "columns[9][name]": "datetime",
    "columns[9][searchable]": "true",
    "columns[9][orderable]": "false",
    "columns[9][search][value]": "|",
    "columns[9][search][regex]": "false",
    "start": "0",
    "length": "25",
    "search[value]": "",
    "search[regex]": "false",
    "sumColumns[]": [
        "alvolumemwh",
        "ksvolumemwh",
        "alnetpositionmwh",
        "ksnetpositionmwh"
    ],
    "avgColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "minColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "maxColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "wdtNonce": "c201b4ccc3"
}

So far so good. Everything works fine and I am able to choose the date and download the data I want.

However, the value of this parameter

"wdtNonce": "c201b4ccc3"

seems to be dynamic and after a while the default value that I am using stops being valid and the request returns no data.

Is there a way to make this persistent?

Is there a way to automatically renew the value of the parameter to a valid one?

Is there a way to circumvent this?

How does my browser "know" beforehand which value it should use for this parameter?

Is this a built in feature intended to block scraping?

I am not posting my code because the code itself works without any problems. Thank you in advance!


Solution

  • Usually, these dynamic strings or tokens come from two sources:

    1. It is generated in the website using javascript
    2. It is from response of a previous request. Either as a cookie, response header, body etc.,

    or a combination of both are any other way

    In this particular website, the token is from the main HTML Page itself.

    GET https://alpex.al/market-results/

    The token is under value key of input tag with id as 'wdtNonceFrontendEdit_53'

    Token in HTML

    You can first fetch the main page, parse and apply the following xpath to extract the wdtNonce id and use it in the API Payload.

    //input[contains(@id, 'wdtNonceFrontendEdit')]/@value

    You should periodically fetch the main page (Whenever the token stops working), extract the id and use it in API Payload to crawl the data. Refer https://stackoverflow.com/a/78007031/11809002 for more info on how to responsibly crawl data and why tokens are used by websites.