I have been trying to download a web page that I ultimately intend to scrape. The page uses Javascript, and has in their code a catch to test if javascript is enabled, and I keep getting it is not enabled.
I am trying to do it under wsl2 (ubuntu) on a windows 10 machine. I have tried with selenium, headless chrome, and axios, and am unable to figure out how to get it to execute the javascript.
As I want to put this into my crontab, I am not using any gui.
The website is
https://app.aquahawkami.tech/nfc?imei=359986122021410
Before I start to scrape the output, I figure I have to first get a good download, and that is where I am stuck.
Here is the javascript:
// index.js
const axios = require('axios');
const fs = require('fs');
axios.get('https://app.aquahawkami.tech/nfc?imei=359986122021410', {responseType: 'document'}).then(response => {
fs.writeFile('./wm.html', response.data, (err) => {
if (err) throw err;
console.log('The file has been saved!');
});
});
Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://app.aquahawkami.tech/nfc?imei=359986122021410")
page_source = driver.page_source
print(page_source)
fileToWrite = open("aquahawk_source.html", "w")
fileToWrite.write(page_source)
fileToWrite.close()
driver.close()
finally headless chrome:
`google-chrome --headless --disable-gpu --dump-dom https://app.aquahawkami.tech/nfc?imei=359986122021410
`
Here's an example of how you can get the data from the api every 6 hours:
async function getMeterData(imei){
/*
this is a template string, it allows constructing/joinning strings cleanly
in this case it will insert the function argument `imei` into the string
eg:`https://api.aquahawkami.tech/endpoint?imei=${imei}` --> https://api.aquahawkami.tech/endpoint?imei=359986122021410
more info :
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
*/
const url = `https://api.aquahawkami.tech/endpoint?imei=${imei}`;
/*
this makes a fetch request (async) to the url.
r.json() is called after the request(promise) fullfils, and .json() itself returns a promise
`await` waits until said promise fullfils (now `data` contains the json object)
more info:
https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/async_function
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise
*/
const data = await fetch(url).then(r => r.json());
/*
console.log() just prints a message to the console, something like python's print()
consol.log(data) will print the data (object) to the console.
more info:
https://developer.mozilla.org/en-US/docs/Web/API/console/log_static
*/
console.log(data);
/*
this is object destructuring, it is equivalent to this:
const slp_time = data.attributes.slp_time;
const reading = data.attributes.reading;
const lastUpdateTime = data.lastUpdateTime;
more info:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment
*/
const {attributes: {slp_time, reading}, lastUpdateTime} = data;
/*
here I convert `slp_time` from a string to an int (parseInt)
then multiply it by the length of the `reading` array
then multiply by 1000 to convert from seconds to milliseconds (Date and setTimeout use milliseconds)
this is simply trying to "dynamically" calculate the 6 hour interval
if you wish you can replace this line entirely with a hardcoded value
eg: const interval_ms = 21600000;
or: const interval_ms = 6 * 60 * 60 * 1000;
*/
const interval_ms = parseInt(slp_time) * reading.length * 1000;
/*
`new Date(lastUpdateTime)` will create a new Date object from the string `lastUpdateTime` (data.lastUpdateTime)
`.getTime()` will return that date as timestamp (in milliseconds)
by adding `interval_ms` to that last update timestamp we should get the next update timestamp
more info:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date
*/
const nextUpdateTime = new Date(lastUpdateTime).getTime() + interval_ms;
/*
`new Date().getTime()` will return the CURRENT date and time as a timestamp
`nextUpdateTime - new Date().getTime()` calculates the different between current time and nextUpdateTime
now we know how long we have to wait from NOW until the next update
*/
const wait = nextUpdateTime - new Date().getTime();
/*
`setTimeout()` sets a timer and calls the supplied function when the timer runs out,
it takes a function (to call) and a timeout in milliseconds
in this case it will call this arrow function: () => getMeterData(imei) after `wait` runs out.
so basically `getMeterData` creates a timer to call itself after a 6 hours and does that indefinitely
more info:
https://developer.mozilla.org/en-US/docs/Web/API/setTimeout
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Arrow_functions
*/
setTimeout(() => getMeterData(imei), wait);
}
// here we make the first call to the `getMeterData` function with '359986122021410' as the imei argument
getMeterData('359986122021410');
You might prefer to use a scheduler/cronjob instead of setTimeout.
There is also this endpoint that has slightly different data: https://api.aquahawkami.tech/meter?meter=83837540
The difference between the two addresses is that the second one /meter
includes a reads
and reading
arrays that have slightly different formats (string vs int & timestamp), and it also seems to includes the whole data from the first address /endpoint
, but the actual values of the readings are the same across arrays/addresses; so you can use whichever one is more convenient.