I am pretty much self taught in webpage scraping and I don't really have a deep understanding of the inner workings of a webpage.
However, I've been able to scrape all websites I've put my hands on.
Until I tried this one.
My goal is to be able to choose the date and download the corresponding prices.
By examining the network traffic, I have been able to replicate the HTTP Request that yields the desired response in JSON format.
The aforementioned request's payload looks like this:
{
"draw": "5",
"columns[0][data]": "0",
"columns[0][name]": "wdt_ID",
"columns[0][searchable]": "true",
"columns[0][orderable]": "false",
"columns[0][search][value]": "",
"columns[0][search][regex]": "false",
"columns[1][data]": "1",
"columns[1][name]": "date",
"columns[1][searchable]": "true",
"columns[1][orderable]": "false",
"columns[1][search][value]": "26+Feb+2024|26+Feb+2024",
"columns[1][search][regex]": "false",
"columns[2][data]": "2",
"columns[2][name]": "mtu",
"columns[2][searchable]": "true",
"columns[2][orderable]": "false",
"columns[2][search][value]": "|",
"columns[2][search][regex]": "false",
"columns[3][data]": "3",
"columns[3][name]": "almcpmwh",
"columns[3][searchable]": "true",
"columns[3][orderable]": "false",
"columns[3][search][value]": "",
"columns[3][search][regex]": "false",
"columns[4][data]": "4",
"columns[4][name]": "alvolumemwh",
"columns[4][searchable]": "true",
"columns[4][orderable]": "false",
"columns[4][search][value]": "",
"columns[4][search][regex]": "false",
"columns[5][data]": "5",
"columns[5][name]": "alnetpositionmwh",
"columns[5][searchable]": "true",
"columns[5][orderable]": "false",
"columns[5][search][value]": "",
"columns[5][search][regex]": "false",
"columns[6][data]": "6",
"columns[6][name]": "ksmcpmwh",
"columns[6][searchable]": "true",
"columns[6][orderable]": "false",
"columns[6][search][value]": "",
"columns[6][search][regex]": "false",
"columns[7][data]": "7",
"columns[7][name]": "ksvolumemwh",
"columns[7][searchable]": "true",
"columns[7][orderable]": "false",
"columns[7][search][value]": "",
"columns[7][search][regex]": "false",
"columns[8][data]": "8",
"columns[8][name]": "ksnetpositionmwh",
"columns[8][searchable]": "true",
"columns[8][orderable]": "false",
"columns[8][search][value]": "",
"columns[8][search][regex]": "false",
"columns[9][data]": "9",
"columns[9][name]": "datetime",
"columns[9][searchable]": "true",
"columns[9][orderable]": "false",
"columns[9][search][value]": "|",
"columns[9][search][regex]": "false",
"start": "0",
"length": "25",
"search[value]": "",
"search[regex]": "false",
"sumColumns[]": [
"alvolumemwh",
"ksvolumemwh",
"alnetpositionmwh",
"ksnetpositionmwh"
],
"avgColumns[]": [
"almcpmwh",
"ksmcpmwh"
],
"minColumns[]": [
"almcpmwh",
"ksmcpmwh"
],
"maxColumns[]": [
"almcpmwh",
"ksmcpmwh"
],
"wdtNonce": "c201b4ccc3"
}
So far so good. Everything works fine and I am able to choose the date and download the data I want.
However, the value of this parameter
"wdtNonce": "c201b4ccc3"
seems to be dynamic and after a while the default value that I am using stops being valid and the request returns no data.
Is there a way to make this persistent?
Is there a way to automatically renew the value of the parameter to a valid one?
Is there a way to circumvent this?
How does my browser "know" beforehand which value it should use for this parameter?
Is this a built in feature intended to block scraping?
I am not posting my code because the code itself works without any problems. Thank you in advance!
Usually, these dynamic strings or tokens come from two sources:
or a combination of both are any other way
In this particular website, the token is from the main HTML Page itself.
GET https://alpex.al/market-results/
The token is under value key of input tag with id as 'wdtNonceFrontendEdit_53'
You can first fetch the main page, parse and apply the following xpath to extract the wdtNonce id and use it in the API Payload.
//input[contains(@id, 'wdtNonceFrontendEdit')]/@value
You should periodically fetch the main page (Whenever the token stops working), extract the id and use it in API Payload to crawl the data. Refer https://stackoverflow.com/a/78007031/11809002 for more info on how to responsibly crawl data and why tokens are used by websites.