node.jsexpressweb-scrapingpuppeteerstock

Puppeteer - Returning null content to my scrape


I am doing a web scrape on the google page, using node.js and puppeteer, so the user enter the ticker of the stock, I concatenate to the google search URL and then I scrape the variation of that stock at the moment. But sometimes it works and sometimes I recieve the error: Error: Evaluation failed: TypeError: Cannot read property 'textContent' of null.

I've already tried using the waitForSelector function and then got a time out, also, using the waitUntil: "domcontentloaded" didn't work as well. What do I do?

Here's the sample of my code that is not working: (There are 3 possible elements, if the variation is up, down or zero, that's why there are 2 conditionals)

const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
const ticker = fundParser(fund);
const url = "https://www.google.com/search?q=" + ticker.ticker; //Ticker value could be rztr11, arct11 or rzak11
await page.goto(url,{ waitUntil: "networkidle2"});
console.log("Visiting " + url);

 // scrapes variation text. If positive or zero, the first scrape will be null, so there's a conditional for changing its value to the correct one
var variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-dn > span:nth-child(1)"
);
if (variation == null) {
  variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-up > span:nth-child(1)"
  );
  if (variation == null) {
    variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-nc > span:nth-child(1)"
    );
}}
console.log("Extracting fund variation");
const variationText = await page.evaluate(
  (variation1) => variation1.textContent,
  variation
);
console.log("Extracted:" + variationText);

Solution

  • Couple of things:

    1. Your selectors are way too brittle: if google updates anything in your selector, it'll break the whole thing. You need to simplify the selector.
    2. You don't need repeated if null checks, you can just pass multiple elements by concatenating selectors with a comma (,).
    3. At the end of the day, if your selector doesn't return anything, you need to decide how your application is going to handle that error state.

    Item 1 - Brittle Selector

    For the first item, you definitely don't want the mangled classnames anywhere in your selector. These are the classes like .wGt0Bc, .WlRRw, .IsqQVc, etc. I don't know what tech google is using under the hood, but it looks like they are using some CSS-in-JS solution, which means that these weird classnames are completed generated and likely will change over time. As such, using them as selectors means your puppeteer script will constantly have to be updated. If instead you avoid using these in your selector, your puppeteer code will function for longer.

    I'd recommend the following selector:

    #knowledge-finance-wholepage__entity-summary .fw-price-dn > span:first-child,
    #knowledge-finance-wholepage__entity-summary .fw-price-up > span:first-child,
    #knowledge-finance-wholepage__entity-summary .fw-price-nc > span:first-child
    

    Since these aren't generated, my guess is these classnames will stay the same for longer.

    Item 2 - using a single selector

    As mentioned above, you don't need repeated calls to page.$(), you can just create a selector that can match multiple elements the same way you would in CSS.

    Item 3 - Error handling

    Ultimately, your code isn't running properly because it isn't handling errors properly. It is up to you to decide how to handle this error. In your example code, you are just logging things out, so maybe you just want to log out that you were unable to get the price change for this stock ticker.

    Putting it all together

    The page.waitForSelector() method returns the ElementHandle if it finds the element, otherwise it throws an error. As such, we can use this directly rather than page.$().

    Here is some code code I was able to test locally that seems to work.

    const puppeteer = require('puppeteer');
    
    (async () => {
        const browser = await puppeteer.launch({ args: ['--no-sandbox'] });
        const page = await browser.newPage();
        const ticker = fundParser(fund);
        const url = 'https://www.google.com/search?q=' + ticker.ticker; //Ticker value could be rztr11, arct11 or rzak11
    
        console.log('Visiting ' + url);
        await page.goto(url, { waitUntil: 'networkidle2' });
        console.log('Visiting ' + url);
    
        const variation_selector = `#knowledge-finance-wholepage__entity-summary .fw-price-dn > span:first-child,
        #knowledge-finance-wholepage__entity-summary .fw-price-up > span:first-child,
        #knowledge-finance-wholepage__entity-summary .fw-price-nc > span:first-child`;
    
        try {
            console.log('Extracting fund variation');
    
            // By default this has a 30 second (30000 ms) timeout. If no element is found after then, an error is thrown.
            const variation = await page.waitForSelector(variation_selector, { timeout: 30000 });
    
            const variationText = await page.evaluate(
                (variation1) => variation1.textContent,
                variation
            );
            console.log('Extracted: ' + variationText);
        } catch (err) {
            console.error('No variation element could be found.');
        }
    
        await browser.close();
    })();
    

    Alternatively, you could also look to get the entire text of some piece of content, and then parse that separately, rather than trying to parse pieces of the DOM.

    For example:

    const knowledge_summary_selector = '#knowledge-finance-wholepage__entity-summary > div > g-card-section';
    let knowledge_summary_inner_text;
    try {
        const knowledge_summary = await page.waitForSelector(knowledge_summary_selector);
        
        /**
         * Example value for `knowledge_summary_inner_text`:
         * "Market Summary > FI Imobiliario Riza Terrax unica\n106.05 BRL\n0.00 (0.00%)\nFeb 12, 6:06 PM GMT-3 ·Disclaimer\nBVMF: RZTR11\nFollow"
         */
        knowledge_summary_inner_text = await page.evaluate(
            (element) => element.innerText.toString().trim(),
            knowledge_summary
        );
    
        // Now, parse your `knowledge_summary_inner_text` via some means
        const knowledge_summary_pieces = knowledge_summary_inner_text.split('\n');
        // etc...
    } catch (err) {
        console.error('...');
    }
    

    Here, knowledge_summary_inner_text looks like:

    Market Summary > FI Imobiliario Riza Terrax unica
    106.05 BRL
    0.00 (0.00%)
    Feb 12, 6:06 PM GMT-3 ·Disclaimer
    BVMF: RZTR11
    Follow
    

    Now this content might be easier to parse, say, after a .split('\n') and some regular expression matching.