javascripthtmlnode.jspuppeteerheadless-browser

Want to scrape table using Puppeteer. How can I get all rows, iterate through rows, and then get "td's" for each row?


I have Puppeteer setup, and I was able get all of the rows using:

let rows = await page.$$eval('#myTable tr', row => row);

Now I want for each row to get "td's" and then get the innerText from those.

Basically I want to do this:

var tds = myRow.querySelectorAll("td");

Where myRow is a table row, with Puppeteer.


Solution

  • One way to achieve this is to use evaluate that first gets an array of all the TDs then returns the textContent of each TD:

    const puppeteer = require('puppeteer');
    
    const html = `
    <html>
        <body>
          <table>
          <tr><td>One</td><td>Two</td></tr>
          <tr><td>Three</td><td>Four</td></tr>
          </table>
        </body>
    </html>`;
    
    (async () => {
      const browser = await puppeteer.launch();
      const page = await browser.newPage();
      await page.goto(`data:text/html,${html}`);
    
      const data = await page.evaluate(() => {
        const tds = Array.from(document.querySelectorAll('table tr td'))
        return tds.map(td => td.innerText)
      });
    
      //You will now have an array of strings
      //[ 'One', 'Two', 'Three', 'Four' ]
      console.log(data);
      //One
      console.log(data[0]);
      await browser.close();
    })();
    

    You could also use something like:

    const data = await page.$$eval('table tr td', tds => tds.map((td) => {
      return td.innerText;
    }));
    
    //[ 'One', 'Two', 'Three', 'Four' ]
    console.log(data);