javascriptnode.jspuppeteerpostgresql-10puppeteer-cluster

How to pull data from PostgreSQL, process, then store in javascript?


I'm not too familiar with advanced javascript and looking for some guidance. I'm looking to store webpage content into DB using puppeteer-cluster Here's a starting example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.content();
    // Store content, do something else
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

Looks like I may have to use pg addon to connect to db. What would be the recommended approach to this?

Here's my table:

+----+-----------------------------------------------------+---------+
| id | url                                                 | content |
+----+-----------------------------------------------------+---------+
| 1  | https://www.npmjs.com/package/pg                    |         |
+----+-----------------------------------------------------+---------+
| 2  | https://github.com/thomasdondorf/puppeteer-cluster/ |         |
+----+-----------------------------------------------------+---------+

I believe I'd have to pull data into an array (id & url), and after each time content is received, store it into the DB (by id & content).


Solution

  • You should create a database connection outside of the task function:

    const { Client } = require('pg');
    const client = new Client(/* ... */);
    await client.connect();
    

    Then you query the data and queue it (with the ID to be able to save it in the database later on):

    const rows = await pool.query('SELECT id, url FROM your_table WHERE ...');
    rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));
    

    And then, at the end of your task function, you update the table row.

    await cluster.task(async ({ page, data: { id, url, id } }) => {
        // ... run puppeteer and save results in content variable
        await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
    });
    

    In total, your code should look like this (be aware, that I have not tested the code myself):

    const { Cluster } = require('puppeteer-cluster');
    const { Client } = require('pg');
    
    (async () => {
        const client = new Client(/* ... */);
        await client.connect();
    
        const cluster = await Cluster.launch({
            concurrency: Cluster.CONCURRENCY_CONTEXT,
            maxConcurrency: 2,
        });
    
        await cluster.task(async ({ page, data: { id, url } }) => {
            await page.goto(url);
            const content = await page.content();
            await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
        });
    
        const rows = await pool.query('SELECT id, url FROM your_table');
        rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));
    
        await cluster.idle();
        await cluster.close();
    })();