google-apps-scriptweb-scrapingcss-selectorscheerio

Can't fetch the next page link utilizing the CSS selector within the apps script


I'm trying to scrape the next page link from this webpage using css selector within apps script, but I always get undefined as a result, even when my defined selector is right.

function fetchInformation() {
  const Url = 'https://www.yellowpages.ca/search/si/1/window/Vancouver+BC';
  const userAgent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like  Gecko) Chrome/88.0.4324.150 Safari/537.36';

  var getOptions = {
    'method': 'GET',
    'headers' : {
       'User-Agent': userAgent
     },
    'muteHttpExceptions': true,
  };

  var response = UrlFetchApp.fetch(Url, getOptions);
  console.log(response.getResponseCode())
  var $ = Cheerio.load(response.getContentText());
  var nextPage = $("a[data-analytics*='load_more'].pageButton").first().attr('href');
  console.log(nextPage);

}

Python script built upon the requests module:

import requests
from bs4 import BeautifulSoup

link = 'https://www.yellowpages.ca/search/si/1/window/Vancouver+BC'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    print(soup.select_one("a[data-analytics*='load_more'].pageButton")['href'])

Output:

/search/si/2/window/Vancouver+BC

How can I get the next page link using the CSS selector within the script?


Solution

  • For whatever reason, the selector you've chosen isn't available in the response returned to GAS. The server apparently serves up different versions of the site depending on whether you're using Python/Node on the one hand, or GAS on the other.

    I'm not sure what makes GAS requests special, but GAS is pretty unusual in a number of respects, so it doesn't come as a surprise. I'm not super familiar with GAS, just get pulled into it here and there for various web scraping requests, so maybe a GAS expert can explain why it's receiving a different response than a standard Node/Python script.

    In any case, I used DriveApp.createFile("test.html", response.getContentText()); to dump the response HTML to file to circumvent log truncation. From that, I noticed that the load_more substring wasn't present, but [data-ajaxpage] was, so I used that:

    function myFunction() { // default GAS function name
      const url = "https://www.yellowpages.ca/search/si/1/window/Vancouver+BC";
      const response = UrlFetchApp.fetch(url);
      // DriveApp.createFile("test.html", response.getContentText()); // to debug
      const $ = Cheerio.load(response.getContentText());
      const nextPage = $("[data-ajaxpage]").first().attr("href");
      console.log(nextPage);
    }
    

    That said, the pattern appears to be:

    So rather than dealing with scraping hrefs, you could just use a loop and insert ${i} into each URL:

    for (let i = 0; i < 10; i++) {
      const url = `https://www.yellowpages.ca/search/si/${i}/window/Vancouver+BC`;
      // make your request with `url` and scrape
    }
    

    I picked 10 arbitrarily, but you can loop as long as you want, and if the request returns a 404 or no data results (depending on the site's specific behavior) then you're at the end.