javascriptnode.jsjsonweb-scrapingpuppeteer

Navigation Timeout Exceeded scraping table in Puppeteer


I am trying to scrape the very first name on a table from a website that presents a basketball team and that team's player's names and statistics. When I do so the Navigation Timeout is exceeded meaning the value was not scraped in the given time, and on my client-side "Error Loading Data" appears. What am I doing wrong?

FYI - There are various debugging statements used that are not essential to the functioning of the code.

Here is my JavaScript code:

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static("public"));

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    console.log('Attempting to scrape data...');
    browser = await puppeteer.launch();
    const [page] = await browser.pages();

    // Increase the timeout to 60 seconds
    await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });

    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });

    const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());

    console.log('Scraping successful:', firstPlayerName);

    res.json({ firstPlayerName });
  } catch (err) {
    console.error('Error during scraping:', err);
    res.status(500).json({ error: 'Internal Server Error' });
  } finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

Here is my HTML code:

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <table>
    <p class="robo-header">Robo-Scout </p>
    <p class="robo-subheader"><br> Official Algorithmic Bakstball Scout</p>
    <tr>
      <td>
        <p id="myObjValue"> Loading... </p>
        <script>
          fetch('/scrape') // Send a GET request to the server
            .then(response => {
              if (!response.ok) {
                throw new Error('Network response was not ok');
              }
              return response.json();
            })
            .then(data => {
              console.log(data); // Check what data is received
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = data.firstPlayerName || 'Player name not found';
            })
            .catch(error => {
              console.error(error);
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = 'Error loading data'; // Display an error message
            });
        </script>
      </td>
    </tr>
  </table>
</body>
</html>

Here is the code from the cell of the table I'm trying to scrape:

                                    <td class="text-left">

    <a href="/player/maddie-bulbulia/girlsbasketball/season/2022-2023">Maddie Bulbulia</a> <small class="text-muted">Sophomore • G</small>
</td>

I have tried debugging the code, to trace why the value isn't being pulled, by outputting when the value is not pulled, and the error. I have also tried increasing the Navigation Timeout to 60 seconds rather than 30 just in case my network was moving slowly, no changes.


Solution

  • This code looks problematic:

    await page.goto(url, { timeout: 60000 });
    
    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });
    

    page.goto() already waits for navigation, so waiting for yet another navigation with page.waitForNavigation() causes a timeout. This is such a common mistake, I have a section about it in my blog post on typical Puppeteer mistakes. The solution is to remove the unnecessary page.waitForNavigation() line.

    Secondly, use page.goto(url, {waitUntil: "domcontentloaded"} rather than the default "load" event. Some anti-scraping approaches (or poorly-coded pages) seem to defer the load event, causing navigation timeouts. "domcontentloaded" is the fastest approach and almost always preferred.

    Going a step further, since the data is baked into the static HTML, you can block all resource requests and disable JS. Here's an optimized script:

    const puppeteer = require("puppeteer"); // ^21.4.1
    
    const url = "<Your URL>";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch({headless: "new"});
      const [page] = await browser.pages();
      await page.setJavaScriptEnabled(false);
      await page.setRequestInterception(true);
      page.on("request", req =>
        req.url() === url ? req.continue() : req.abort()
      );
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const firstPlayerName = await page.$eval(
        "td.text-left a",
        player => player.textContent.trim()
      );
      console.log("Scraping successful:", firstPlayerName);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Going yet another step further, you may not even need Puppeteer. You can make a request with fetch, native in Node 18+, and parse the data you want from the response with a lightweight library like Cheerio.

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const url = "<Your URL>";
    
    fetch(url)
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
        const firstPlayerName = $("td.text-left a").first().text()
        console.log(firstPlayerName); // => Maddie Bulbulia
      })
      .catch(err => console.error(err));
    

    Here are some quick benchmarks.

    Unoptimized Puppeteer (only using "domcontentloaded"):

    real 0m2.974s
    user 0m1.004s
    sys  0m0.271s
    

    Optimized Puppeteer (using DCL, plus disabling JS and blocking resources):

    real 0m1.190s
    user 0m0.510s
    sys  0m0.114s
    

    Fetch/Cheerio:

    real 0m0.998s
    user 0m0.261s
    sys  0m0.049s
    

    If the scraped data doesn't change often, you might consider caching the results of the scrape periodically so you can serve it up to your users instantly and more reliably.

    Disclosure: I'm the author of the linked blog posts.