javascriptweb-scrapingpuppeteer

How to automatically scrape table rows for specific columns with Puppeteer?


I'm trying to make a personal project which will gather text and numbers data from wikipedia with a scraper which would later move all that to the database and then compare all these gathered values to make a visual representation.

But I'm having some difficulties with selecting the values I want to gather, specifically for HTML tables. I'd like to select all rows with data inside and only with specific columns.

For example I have a table like this:

Column1       column2    column3
rowdata1      rowdata1   rowdata1
rowdata2      rowdata2   rowdata2

And I want to make it look like this:

Column1       column3 
rowdata1      rowdata1   
rowdata2      rowdata2   

Without a second column and its rows for example. So, is there any simple and straightforward solution on how to do so? Because manually picking names and numbers with Xpath is going to take ages. Here is an example of my current code below

const puppeteer = require('puppeteer');

async function scrapewiki(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);

    const [el] = await page.$x('/html/body/div[3]/div[3]/div[5]/div[1]/table/tbody/tr[2]/td[1]/a');
    const txt = await el.getProperty('textContent');
    const country = await txt.jsonValue();

    const [el2] = await page.$x('/html/body/div[3]/div[3]/div[5]/div[1]/table/tbody/tr[2]/td[3]');
    const txt2 = await el2.getProperty('textContent');
    const population = await txt2.jsonValue();

Solution

  • First of all, please avoid using browser-generated CSS selectors and XPaths. They're handy from time to time, but almost always suboptimal. Looking at the table, there's a much cleaner way to identify the data you want: <table class="wikitable sortable">. The table has <tr> elements for each row with <td> cells. Standard tablular setup here.

    The CSS selector table.wikitable.sortable tr (I'm assuming it's the only one on the page) gives all of the rows, and for each row, extract the cells with the selector td, giving a 2-dimensional table. .slice(2) is useful to rip out the headers.

    Secondly, and this is sort of a personal opinion, but I almost always use CSS selectors unless I have to use XPaths or it's a special case where they're cleaner. The XPath syntax is nastier than CSS selectors.

    const puppeteer = require("puppeteer"); // 15.4.0
    
    let browser;
    (async () => {
      browser = await puppeteer.launch({headless: true});
      const [page] = await browser.pages();
      const url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population";
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const tableSel = "table.wikitable.sortable tr";
      const data = await page.$$eval(tableSel, els =>
        els.slice(2).map(el =>
          [...el.querySelectorAll("td")]
            .map(e => e.textContent.trim())
        )
      );
      const nameAndPop = data.map(e => [e[0], e[2]]);
      console.table(nameAndPop.slice(0, 10));
      console.log("total rows", data.length);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close())
    ;
    

    Finally, you don't need Puppeteer for this. Use the Wikipedia API preferably, or else a simple lightweight HTTP fetch/axios request in conjunction with an HTML parsing library like Cheerio. You can consider the above code for educational purposes only; it's not the best way to do things.

    Consider Cheerio with fetch on Node 18:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    fetch("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
      .then(res => res.text())
      .then(text => {
        const $ = cheerio.load(text);
        const rows = [];
        $("table.wikitable.sortable tr").slice(2).each(function (i, e) {
          const row = [];
          rows.push(row);
          $(this).find("td").each(function (i, e) {
            row.push($(this).text().trim());
          });
        });
        const nameAndPop = rows.map(e => [e[0], e[2]]);
        console.table(nameAndPop);
        console.table(rows.length);
      })
    ;
    

    On my slow Windows 10 netbook, I can run the Cheerio script in 3 seconds, while Puppeteer with a cold cache takes 34 seconds (using Measure-Command).

    Disabling JS and blocking images and other resources is a good idea with Puppeteer, but I didn't bother; most of the overhead is launching the browser.

    See also:

    Disclosure: I'm the author of the linked blog posts.