node.jsweb-scrapingapify

How do I pass parameters to Apify BasicCrawler handleRequestFunction?


I'm trying to migrate an existing function to use it inside an Apify actor.

Originally, the function loads a given URL, reads its JSON response, and according to some supplied parameters, extracts some data and returns an object with results.

If you ask, it's not scraping anything "final" at this point. Its results are temporary and will be used to create other URLs which will be scraped then (with another crawler) for actual, useful results.

The current function that executes the crawler is something like this:

let url = new URL('/content', someBaseURL);
url.searchParams.set('search', someKeyword);
const reqList = new apify.RequestList({
    sources: [ { url: url.toString() } ]
});
await reqList.initialize();
const crawler = new apify.BasicCrawler({
    requestList: reqList,
    handleRequestFunction: reqHandler
});
// How do I set the inputs for reqHandler() here ?
await crawler.run();
// How do I get the output from reqHandler() here ?

And the reqHandler code is something like this:

async function reqHandler(options) {
    const response = await apify.utils.requestAsBrowser({
        url: options.request.url
    });
    // How do I read parameters from the caller here ?
    let searchResults = JSON.parse(response.body);
    // ... result object creation logic goes here ...
    // How do I return a result to the caller here ?
}

I am pretty new to this Apify thing and lost in the documentation.

Thanks for your help.


Solution

  • handleRequestFunction doesn't take any external input or produce any outputs. Simply use it as a closure and capture inputs from the surrounding code or you can wrap it in a different function.

    Normally we do it like this:

    const context = {}; // put your inputs here
    
    const crawler = new apify.BasicCrawler({
        requestList: reqList,
        handleRequestFunction: async () => {
            // use context here
            
            // output data
            await Apify.pushData(results);
        }
    });
    

    EDIT: I forgot to mention a use-case on how to pass input. You need to do it via the request.userData object when adding to a queue or a list.

    // The same userData is available in request list.
    await requestQueue.addRequest({
        url: 'https://example.com',
        userData: { myInput: 'any-data' }
    });
    
    // Then in handleRequestFunction
    handleRequestFunction: async (( request }) => {
       const { myInput } = request.userData;
       // ...
    }