I'm trying to scrape some web pages in the TOR network, using Puppeteer and the tor package (apt install tor
).
Probably due to the nature of TOR connections sometimes I get a timeout.
In addition, I'm new to asynchronous programming in JavaScript.
Usually I have a try-catch-construct like these:
await Promise.all([
page.goto(url),
page.waitForNavigation({
waitUntil: 'domcontentloaded'
}),
]).catch((err) => { logMyErrors(err, true); });
or
let langMenu = await page.waitForXPath('//*[contains(@class, ".customer_name")]/ancestor::li').catch((err) => { logMyErrors(err, true); });
But I think often one or more retries would help to finally get the desired resource. Is there any best practice to implement retries?
I would recommend this rather simple approach:
async function retry(promiseFactory, retryCount) {
try {
return await promiseFactory();
} catch (error) {
if (retryCount <= 0) {
throw error;
}
return await retry(promiseFactory, retryCount - 1);
}
}
This function calls the promiseFactory
, and waits for the returned Promise to finish. In case an error happens the process is (recursively) repeated until retryCount
reaches 0
.
Code Sample
You can use the function like this:
await retry(
() => page.waitForXPath('//*[contains(@class, ".customer_name")]/ancestor::li'),
5 // retry this 5 times
);
You can also pass any other function returning a Promise like Promise.all
:
await retry(
() => Promise.all([
page.goto(url),
page.waitForNavigation({ waitUntil: 'domcontentloaded' }),
]),
1 // retry only once
);
Don't combine await and catch
Another advice: You should not combine await
with .then
or .catch
as this will result in unexpected problems. Either use await
and surround your code with a try..catch
block or use .then
and .catch
. Otherwise your code might be waiting for the results of a catch
function to finish, etc.
Instead, you use try..catch
like this:
try {
// ...
} catch (error) {
logMyErrors(error);
}