javascriptweb-scrapingpuppeteerskyscanner

JavaScript Iterating List of Objects


I'm writing a scraper for Skyscanner just for fun. What I'm trying to do is to iterate through the list of all listings, and for each listing, extract the URL.

enter image description here

What I've done so far is getting the listing $("div[class^='FlightsResults_dayViewItems']") which returns

enter image description here

but I'm not sure how to iterate through the returned object and get the URL (/transport/flight/bos...). The pseudo code that I have is

for(listings in $("div[class^='FlightsResults_dayViewItems']")) {
     go to class^='EcoTickerWrapper_itineraryContainer' 
          go to class^='FlightsTicket_container'
               go to class^='FlightsTicket_link' and get the href and save in an array
}

How would I go about doing this? Side-note, I'm using cheerio and jquery.

Update: I figured out the CSS selector is

$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(at_index_i) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > a[class^='FlightsTicket_link']").href

Now, I'm trying to figure out how to loop through the listing and apply the selector for each listing in the loop.

Also, it seems like not including the div:nth-child(at_index_i) won't work. Is there a way around this?

$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(3) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")

"/transport/flights/bos/cun/210301/210331/config/10081-2103010815--32733-0-10803-2103011250|10803-2103311225--31722-1-10081-2103312125?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27540602&inboundaltsenabled=false&infants=0&originentityid=27539525&outboundaltsenabled=false&preferdirects=false&preferflexible=false&ref=home&rtn=1"


$("div[class^='FlightsResults_dayViewItems'] > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")

undefined

Here's the function to iterate the listings and grab the URLs for each listing.

async function scrapeListingUrl(listingURL) {
  try {
    const page = await browser.newPage();
    await page.goto(listingURL, { waitUntil: "networkidle2" });
    // await page.waitForNavigation({ waitUntil: "networkidle2" }); // Wait until page is finished loading before navigating
    console.log("Finished loading page.");

    const html = await page.evaluate(() => document.body.innerHTML);
    fs.writeFileSync("./listing.html", html);

    const $ = await cheerio.load(html); // Inject jQuery to easily get content of site more easily compared to using raw js

    // Iterate through flight listings
    // Note: Using regex to match class containing "FlightsResults_dayViewItems" to get listing since actual class name contains nonsense string appended to end.
    const bookingURLs = $('a[class*="FlightsTicket_link"]')
      .map((i, elem) => console.log(elem.href))
      .get();

    console.log(bookingURLs);
    return bookingURLs;
  } catch (error) {
    console.log("Scrape flight url failed.");
    console.log(error);
  }
}

Solution

  • Using map()

    const hrefs = $(selector).map((i, elem) => elem.href).get()
    

    Looking at the code you are not using jQuery so above does not work. So you just need to use a basic selector that matches part of the class with querySelectorAll. And map is used to grab the hrefs.

    const links = [...document.querySelectorAll('a[class*="FlightsTicket_link"]')]
        .map(l=>l.href)