javascriptweb-scrapingpuppeteer

Puppeteer is not scraping the last page


I'm scraping a news article for my project using Puppeteer. I'm unable to scrape the last page. There are 10 pages and each page has 10 links (total 100 data). However, I’m noticing that sometimes it only scrapes data from 39 articles and other times, it scrapes around 90. I'm not sure why this happens.

Here's the code I'm using:

await page.goto(url, { timeout: 90000 })

    const result: Result[] = []

    await page.waitForSelector('div.gsc-cursor-page', { timeout: 90000 })

    let pageElements = await page.$$("div.gsc-cursor-page")

    await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 })
    
    for(let i = 0; i < pageElements.length; i++){
        const pageElement = pageElements[i]

        // Click page element only if it's not the first page

        if(i !== 0){
            await page.evaluate((el) => {
                el.click()
            }, pageElement)

           // Wait for content to load after page navigation
            
            await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 }).catch(err => {
                return
            })
        }


        // Re-fetch the page elements after navigation

        pageElements = await page.$$("div.gsc-cursor-page")   

        // Extract article links from the page

        let elements: ElementHandle<HTMLAnchorElement>[] = await page.$$("div.gsc-resultsbox-visible > div > div div > div.gsc-thumbnail-inside > div > a")

        for (const element of elements) {
            try {
                const link = await page.evaluate((el: HTMLAnchorElement) => el.href, element)
    

                // Open article page and scrape data

                const articlePage = await browser.newPage()
                await articlePage.goto(link, { waitUntil: 'load', timeout: 90000 })
                await articlePage.waitForSelector("h1.title", { timeout: 90000 }).catch(err => {
                    return
                })
    
                const title = await articlePage.$eval("h1.title", (element) => element.textContent.trim())
                const body = await articlePage.$$eval("div.articlebodycontent p", (elements) =>
                    elements.map((p) => p.textContent.replace(/\n/g, " ").replace(/\s+/g, " "))
                )
                result.push({ title, content: body.join(" ") })
                await articlePage.close()
            }
            catch (error) {
                console.log("Error extracting article:", error)
            }
        }
    }

    return { searchQuery, length: result.length, result }
}

How can I fix this issue ?


Solution

  • If you examine the site's network requests, it's making a call to a third party Google search API:

    Screenshot of network tab showing network search results for article title

    If you make that same request as the website, you can avoid the pain of automating it entirely:

    const makeUrl = offset =>
      `https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=20&start=${offset}&hl=en&source=gcsc&cselibv=75c56d121cde450a&cx=264d7caeb1ba04bfc&q=%24{searchQuery}&safe=active&cse_tok=AB-tC_51w3gnpTUdkduvVczddH5_%3A1743226475620&lr=&cr=&gl=&filter=0&sort=&as_oq=&as_sitesearch=&exp=cc&callback=google.search.cse.api15447&rurl=https%3A%2F%2Fwww.thehindu.com%2Fsearch%2F%23gsc.tab%3D0%26gsc.q%3D%24{searchQuery}%26gsc.sort%3D`;
    
    (async () => {
      const results = [];
      const userAgent =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36";
    
      for (let page = 0; page < 10; page++) {
        const res = await fetch(makeUrl(page * 20), {
          headers: {"User-Agent": userAgent},
        });
    
        if (!res.ok) {
          break;
        }
    
        const text = await res.text();
        const startIndex = text.indexOf("{");
        const endIndex = text.lastIndexOf(")");
        const json = text.slice(startIndex, endIndex);
        results.push(...JSON.parse(json).results);
      }
    
      console.log(results.map(e => e.title));
    })();
    

    But be careful with this since Google can block you pretty easily doing this--you'll probably want to add a user agent, proxy, or throttle requests. If detected, Puppeteer becomes useful again, but you can still keep the philosophy of avoiding the DOM and focus on intercepting those responses:

    const fs = require("node:fs/promises");
    const puppeteer = require("puppeteer"); // ^24.4.0
    
    let browser;
    (async () => {
      browser = await puppeteer.launch({headless: false});
      const [page] = await browser.pages();
      const ua =
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
      await page.setUserAgent(ua);
      const results = [];
    
      for (let pg = 1; pg <= 10; pg++) {
        try {
          const responseArrived = page.waitForResponse(async res =>
            res.request().url().includes("google.search.cse") &&
            (await res.text()).includes("total_results")
          );
          await page.goto(
            `https://www.thehindu.com/search/#gsc.tab=0&gsc.q=%24{searchQuery}&gsc.sort=&gsc.page=${pg}`,
            {waitUntil: "domcontentloaded"}
          );
          const response = await responseArrived;
          const text = await response.text();
          const startIndex = text.indexOf("{");
          const endIndex = text.lastIndexOf(")");
          const json = text.slice(startIndex, endIndex);
          results.push(...JSON.parse(json).results);
        }
        catch {
          break;
        }
      }
    
      await fs.writeFile(
        "results.json",
        JSON.stringify(results, null, 2)
      );
      console.log(results.map(e => e.title));
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());