javascriptnode.jscheerio

Using Cheerio for scraping many sites


I am using cheerio to scrape about 800 websites, to just get the site title. The first issue that I have is that sometimes I am getting an error message saying "We’ve encountered an error: Error: socket hang up". Secondly, maybe because of cheerio's asynchronous nature, when I log the created objects they all have the address of the last web address in the array. Finally, I log the array that I have been pushing the objects into, but it is actually logging that immediately as [], because it's completing this before it does anything else. How can I fix these three issues? I've been

var tempArr = [];
var completedLinks = ["http://www.example.com/page1", "http://www.example.com/page2", "http://www.example.com/page3"...];

for (var foundLink in completedLinks){

  if(ValidURL(completedLinks[foundLink])){

    request(completedLinks[foundLink], function (error, response, body) {

      if (!error) {
        var $ = cheerio.load(body);
        var titles = $("title").text();

        var tempObj = {};
        tempObj.title = titles;
        tempObj.address = completedLinks[foundLink]

        tempArr.push(tempObj);

        console.log(tempObj)
      }else{
        console.log("We’ve encountered an error: " + error);
      }    

    });

  }
}
console.log(tempArr);

Solution

  • Your hang up is probably because many sites implement rate limiting. I'd guess that these errors tend to happen on page 2 of a site. One thing you might want to do is organize your links into lists by host and use setTimeout to throttle each call after the first to that host.

    Your "last web address" issue is a classic JavaScript gotcha regarding scope. At the very least you should process each request in a function like:

    function processLink(link){
     if(ValidURL(link)...
    }
    

    then

    for (var foundLink in completedLinks){ 
      processLink(completedLinks[foundLink]);
    }
    

    finally wrt waiting until all are done before exiting you should consider Promises

    Ignoring the throttling issue:

    function processLink(link){
      return new Promise(function(resolve, reject) {
        request(link, function (error, response, body) {
            if (!error) {
              resolve(tempObj);
            }
    
        });
      });
    };
    
    var promises = [];
    for (var foundLink in completedLinks){ 
      promises.push(processLink(completedLinks[foundLink]));
    }
    Promise.all(promises).then(function(tempObjArr){...});