javascriptnode.jsweb-scrapingcheerio

Empty result scraping site with Fetch and Cheerio


I decided for the sake of interest to collect data from the site (name, price per night, rating) for myself and encountered a misunderstanding. I get nothing on the output. I rewrote on other libraries but they say this one is better.

const cheerio = require("cheerio"); 
let fs = require('fs');
const base = "https://ostrovok.ru/hotel/russia/adler/";

(async () => {
  let url = "?page=1";
  const data = [];

  for (let i = 0; i < 176; i++) {
    try {
      console.log(base + url);
      const res = await fetch(base + url);

      if (!res.ok) {
        break;
      }

      const $ = cheerio.load(await res.text());
      const chunk = [...$("")].map(e =>
        $(e).text().trim()
      );
      data.push(chunk);
      url = $("#__next > div > div:nth-child(2) > div > div > div.Layout_content__9ap_g > div:nth-child(3) > div > div.HotelCard_headerArea__hlQPk > div > div.HotelCard_mainInfo__pNKYU > div.HotelCard_wrapTitle__t742O > h2 > a").attr("TEXT");
    }
    catch (err) {
      console.error(err);
      break;
    }
  }

  console.log(JSON.stringify(data, null, 2));

  fs.writeFile('numbers.txt', data.join('\n'), function(err) {
    if (err) {
        console.log(err);
    }
});

})();

I was expecting to see a list of data, but I got [].


Solution

  • base + url always uses "?page=1". Try interpolating the index variable in: `${base}?page=${i}`.

    .attr("TEXT") looks incorrect. I assume you want all 20 hotel names on each page, so use [...$("...")].map(e => $(e).text()) to collect each name as a separate array element.

    As for the selector, long, browser-generated ultra-rigid selectors are prone to error. If any assumption in that chain changes, the whole thing breaks. Safer to use ".HotelCard_title__cpfvk", which is all that's needed to identify the element you want, and nothing more or less.

    !res.ok isn't enough to determine when the pagination ends. Break when the result list is empty.

    Putting it together:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    const {writeFile} = require("node:fs/promises");
    
    const url = "<Your URL>";
    
    (async () => {
      const data = [];
    
      for (let i = 1; i <= 1000; i++) {
        const res = await fetch(`${url}?page=${i}`);
    
        if (!res.ok) {
          break;
        }
        
        const $ = cheerio.load(await res.text());
        const chunk = [...$(".HotelCard_title__cpfvk")]
          .map(e => $(e).text());
    
        if (!chunk.length) {
          break;
        }
    
        data.push(...chunk);
      }
    
      console.log(data);
      await writeFile("numbers.txt", JSON.stringify(data));
    })();
    

    This takes awhile to run, so you could parallelize requests (at the risk of angering the server), or simply add some logs to ensure each chunk is coming through OK.

    To get the other fields you want, you can modify the script as follows:

    const chunk = [...$('[data-testid="serp-hotelcard"]')]
      .map(e => ({
        name: $(e).find('[class*="HotelCard_title"]').text(),
        price: $(e).find('[class*="HotelCard_ratePriceValue"]').text(),
        rating: $(e).find('[class*="TripAdvisor_tripAdvisor_value"]')
          .first()
          .attr("class")
          ?.split(/\s+/)
          .find(e => e.includes("TripAdvisor_tripAdvisor_value"))
          .match(/_value_(\d+)_/)[1]
          .split("")
          .join("."),
      }));
    

    Note that I've loosened some selectors to use substrings, avoiding a situation where the generated-looking substring "cpfvk" changes in ".HotelCard_title__cpfvk".

    Disclosure: I'm the author of the linked blog post.