javascriptnode.jsweb-scrapingbots

Node.js Scraping Data Click Event


I have a repetitive task that I have to do at regular intervals. Basically, I need to enter the website, get some values from different tables then write them on spreadsheet. By using these values, make some calculation, prepare a report etc.

I would like to create a helper bot because this is straight forward task to do. I can basically get information by opening up console (while I am on the related page) and by using DOM or Jquery I am fetching data easily.

I would like to take it a step further and create an application on Node.js (without entering related website, I will send my bot to related page and do same actions that I do on console.) I started to write something with cheerio. However, at some point my bot needs to click a button (in order to change table). I searched but couldn't find the way.

My question is "clicking a button on server side (change the table) and fetch data from that table is possible ?"

If do you know better way to create this kind of bot, please make suggestion.

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', (req, res) => {

url = 'http://www.imdb.com/title/tt1229340/';

request(url, function(error, response, html){
    if(!error){
        var $ = cheerio.load(html);

    var title, release;
    var json = { title : "", release : ""};

    $('.header').filter(() => {
        var data = $(this);
        title = data.children().first().text();            
        release = data.children().last().children().text();

        json.title = title;
        json.release = release;
    })

    // This is not possible
    $( "#target" ).click(function() {
      alert( "Handler for .click() called." );
    });
}

fs.writeFile('output.json', JSON.stringify(json, null, 4), (err) => {
    console.log('File successfully written!);
})

res.send('Check your console!')

    }) ;
})
app.listen('8080');

edit: The Answer of this question is "Use Zombie"

Now I have another question related to this one. I am trying to learn & use zombie. I could

However by using this method, I could only get really messed up string. (All tds were printed without any whitespace, no chance to clean out, basically I want to put all tds in an array. How can I do that ?)

browser.visit(url, () => {
        var result = browser.text('table > tbody.bodyName td');
        console.log(result);
})

Solution

  • I'd suggest you try using a headless browser such as Phantom.js or Zombie for this purpose. What you're trying to do above is assign a click handler to an element in Cheerio, this won't work!

    You should be able to click a button based on the element selector in Zombie.js.

    There's a browser.pressButton command in Zombie.js for this purpose.

    Here's some sample code using zombie.js, in this case clicking a link..

    const Browser = require('zombie');
    const url = 'http://www.imdb.com/title/tt1229340/';
    
    let browser = new Browser();
    browser.visit(url).then(() => {
        console.log(`Visited ${url}..`);
        browser.clickLink("FULL CAST AND CREW").then(() => {
            console.log('Clicked link..');
            browser.dump();
        });
    }).catch(error => {
        console.error(`Error occurred visiting ${url}`);
    });
    

    As for the next part of the question, we can select elements using zombie.js and get an array of their text content:

    const Browser = require('zombie');
    const url = 'http://www.imdb.com/title/tt1229340/';
    
    let browser = new Browser();
    browser.visit(url).then(() => {
        console.log(`Visited ${url}..`);
        var result = browser.queryAll('.cast_list td');
        var cellTextArray = result.map(r => r.textContent.trim())
        .filter(text => text && (text || '').length > 3);
    
        console.log(cellTextArray);
    }).catch(error => {
        console.error(`Error occurred visiting ${url}`);
    });