pythonpdfkitpython-pdfkit

Ad on page causes problem "Bad Gateway" for scraping


I am trying to save some webpages as PDF with Python using pdfkit. It worked pretty well last week (so, I guess it's not a coding error by me). Now, the site changed some ads which causes a "Bad Gateway" problem. I think it might be because of some web beacon. I get the following error code when pdfkit tries to save the page as PDF:

Warning: A slow script was stopped
Error: Failed to load https://sync-eu.connectad.io/syncer/1,
with network status code 301 and http status code 502 - Error
downloading https://sync-eu.connectad.io/syncer/1 - server
replied: Bad Gateway

The script in the source code of the page I'd like to save which presumably causes the error:

<script type="text/javascript">
var ss=(function(){var pixelUrls=['https://sync-eu.connectad.io/syncer/1'];var MINS=60*1000;var SMARTSYNC_CALLBACK='serverbidCallBids';var SYNC_COOKIE_TTL=0*MINS;var SYNC_COOKIE='sb_ss';var pixelsInFlight=[];var inSecure=window.location.protocol.indexOf('s')<0;function createPixel(src){var p=document.createElement('iframe');p.setAttribute("height","0px");p.setAttribute("width","0px");p.setAttribute("border","0");p.setAttribute("frameBorder","0");p.setAttribute("style","position:absolute;");p.onerror=function(){return this.style.display="none";};p.setAttribute("src",src);if(window[SMARTSYNC_CALLBACK]){p.onload=function(){var i=pixelsInFlight.indexOf(src);if(i>=0){pixelsInFlight.splice(i,1);}
if(!pixelsInFlight.length){window[SMARTSYNC_CALLBACK]();}};pixelsInFlight.push(src);}
document.body.appendChild(p);}
function createCookie(){if(document.cookie.indexOf(SYNC_COOKIE)<0){var date=new Date();date.setTime(date.getTime()+ SYNC_COOKIE_TTL);return(document.cookie=SYNC_COOKIE+"; expires="+ date.toUTCString()+"; path=/"||1);}}
if(createCookie()){for(var i=0;i<pixelUrls.length;i++){var pixelUrl=pixelUrls[i];if(inSecure||pixelUrl.match(/^https:/)){createPixel(pixelUrl);}}}else if(window[SMARTSYNC_CALLBACK]){window[SMARTSYNC_CALLBACK]();}});var waitForDOM=function(evt){if(evt.target.readyState==="interactive"){ss();}}
document.addEventListener('readystatechange',waitForDOM,false);
</script>

The problem is that pdfkit saves the PDF, but due to the error, it aborts the loop, so only one instead of around ten pages are saved.

It would be great if somebody could help me how to solve this problem.


Solution

  • Ok, there are many similar problems with wkhtmltopdf as reported on git: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2051

    My solution now is to simply turn off all javascripts. For many users this won't be a good solution. However, it's worth a try, if somebody is stuck as I am:

    import pdfkit
    path2wkthmltopdf = r"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" #replace path
    pdfconfig = pdfkit.configuration(wkhtmltopdf=path2wkthmltopdf)
    options = {'encoding': "UTF-8",'disable-javascript': None}
    pdfkit.from_url(your_url, your_path2save, configuration=pdfconfig, options=options)
    

    Further options for wkhtmltopdf can be found on: https://wkhtmltopdf.org/usage/wkhtmltopdf.txt