javascriptselenium-webdriverweb-scrapinggoogle-chrome-headless

Not able to download a web page that uses javascript


I have been trying to download a web page that I ultimately intend to scrape. The page uses Javascript, and has in their code a catch to test if javascript is enabled, and I keep getting it is not enabled.

I am trying to do it under wsl2 (ubuntu) on a windows 10 machine. I have tried with selenium, headless chrome, and axios, and am unable to figure out how to get it to execute the javascript.

As I want to put this into my crontab, I am not using any gui.

The website is

https://app.aquahawkami.tech/nfc?imei=359986122021410

Before I start to scrape the output, I figure I have to first get a good download, and that is where I am stuck.

Here is the javascript:

// index.js

const axios = require('axios');
const fs = require('fs');
axios.get('https://app.aquahawkami.tech/nfc?imei=359986122021410', {responseType: 'document'}).then(response => {
  fs.writeFile('./wm.html', response.data, (err) => {
        if (err) throw err;
        console.log('The file has been saved!');
    });
});

Selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get("https://app.aquahawkami.tech/nfc?imei=359986122021410")

page_source = driver.page_source
print(page_source)
fileToWrite = open("aquahawk_source.html", "w")
fileToWrite.write(page_source)
fileToWrite.close()
driver.close()

finally headless chrome:

`google-chrome --headless --disable-gpu --dump-dom https://app.aquahawkami.tech/nfc?imei=359986122021410

`


Solution

  • Here's an example of how you can get the data from the api every 6 hours:

    async function getMeterData(imei){
      /* 
      this is a template string, it allows constructing/joinning strings cleanly
      in this case it will insert the function argument `imei` into the string
      eg:`https://api.aquahawkami.tech/endpoint?imei=${imei}` --> https://api.aquahawkami.tech/endpoint?imei=359986122021410
      
      more info : 
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
      */
      const url = `https://api.aquahawkami.tech/endpoint?imei=${imei}`;
       
      /* 
      this makes a fetch request (async) to the url.
      r.json() is called after the request(promise) fullfils, and .json() itself returns a promise
      `await` waits until said promise fullfils (now `data` contains the json object)
      
      more info: 
        https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/async_function
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise
      */
      const data = await fetch(url).then(r => r.json());
    
      /*
      console.log() just prints a message to the console, something like python's print()
      consol.log(data) will print the data (object) to the console.
    
      more info:
        https://developer.mozilla.org/en-US/docs/Web/API/console/log_static
      */
      console.log(data);
    
      /*
      this is object destructuring, it is equivalent to this:
      const slp_time = data.attributes.slp_time;
      const reading = data.attributes.reading;
      const lastUpdateTime = data.lastUpdateTime;
      
      more info:
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment
      */
      const {attributes: {slp_time, reading}, lastUpdateTime} = data;
    
      /*
      here I convert `slp_time` from a string to an int (parseInt)
      then multiply it by the length of the `reading` array 
      then multiply by 1000 to convert from seconds to milliseconds (Date and setTimeout use milliseconds)
    
      this is simply trying to "dynamically" calculate the 6 hour interval
      if you wish you can replace this line entirely with a hardcoded value
      eg: const interval_ms = 21600000;
      or: const interval_ms = 6 * 60 * 60 * 1000;
      */
      const interval_ms =  parseInt(slp_time) * reading.length * 1000; 
      
      /*
      `new Date(lastUpdateTime)` will create a new Date object from the string `lastUpdateTime` (data.lastUpdateTime)
      `.getTime()` will return that date as timestamp (in milliseconds)
      by adding `interval_ms` to that last update timestamp we should get the next update timestamp
    
      more info:
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date
      */
      const nextUpdateTime = new Date(lastUpdateTime).getTime() + interval_ms;
    
      /*
      `new Date().getTime()` will return the CURRENT date and time as a timestamp
      `nextUpdateTime - new Date().getTime()` calculates the different between current time and nextUpdateTime
       now we know how long we have to wait from NOW until the next update
      */
      const wait = nextUpdateTime - new Date().getTime();
    
      /*
      `setTimeout()` sets a timer and calls the supplied function when the timer runs out,
      it takes a function (to call) and a timeout in milliseconds
      in this case it will call this arrow function: () => getMeterData(imei) after `wait` runs out.
      so basically `getMeterData` creates a timer to call itself after a 6 hours and does that indefinitely
    
      more info:
        https://developer.mozilla.org/en-US/docs/Web/API/setTimeout
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Arrow_functions
      */
      setTimeout(() => getMeterData(imei), wait);
    }
    
    // here we make the first call to the `getMeterData` function with '359986122021410' as the imei argument
    getMeterData('359986122021410');
    

    You might prefer to use a scheduler/cronjob instead of setTimeout.

    There is also this endpoint that has slightly different data: https://api.aquahawkami.tech/meter?meter=83837540

    The difference between the two addresses is that the second one /meter includes a reads and reading arrays that have slightly different formats (string vs int & timestamp), and it also seems to includes the whole data from the first address /endpoint, but the actual values of the readings are the same across arrays/addresses; so you can use whichever one is more convenient.