node.jsweb-scrapingaxioscheerio

How can i achieve multiple page scraping with axios and cheerio


Hello i am using axios with cheerio to scrape some data.I want to scrape multiple pages, the url structure is like example.com/?page=1.How i can scrape every single page with a counter ?

axios({
    method: "get",
    url:
      "https://example.com/?page=",
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
    }
  }).then(res => {

Solution

  • I believe there are multiple ways to achieve that solution but basically you need to execute all the axios and parse all of them with Cheerio programatically.

    If you know how many pages that you want to scrape

    You can create a simple for loop and push all the axios functions into an array one by one with the generated urls. Then you can call these with Promise.all

    const promises = [];
    
    for(let page = 0; page <= 5; page ++){
         promises.push(
              axios({method: "get",url:`https://example.com?page=${page}`})
              .then(res => {
                  // Parse your result with Cheerio or whatever you like
              })
         );
    }
    
    // You can pass the responses on this resolve if you want.
    Promise.all(promises).then(...)
    

    If you are scraping a list page and total page number is unknown

    Then you can create a async/recursive function for dispatching the request with axios and conditionally iterate. With that way you can also reduce maximum usage of memory when you compare with the solution below. And it will be slower because the requests will not be in parallel.

    // The function below is kind-of pseudo code so don't try to copy/paste it :) 
    const dispatchRequest = (page) => {
         const response = axios({url: `https://example.com?page=${page}`});
         // Ex: You can parse the response here with Cheerio and check if pagination is not disable
         if(something){
              return dispatchRequest(page+1);
         }
         else{
             return response;
         }
    
    }
    

    The solutions above has down-sides of course. If you get blocked by target website or somehow your request fails, you have no chance to retry the same request or rotate your proxies to bypass the target websites security.

    I'd suggest you to implement a queue and put all of the request dispatch functions there. With that way you can detect fails/problems and enqueue the failed requests again. You can also implement both of the solutions above with a queue support. You can run it in parallel and manage your memory/CPU consuming way much better.

    Also you can use SDKs too. I saw there are couple of scraping SDKs with provides you this whole toolset so you won't re-invent the wheel.