python-3.xwkhtmltopdfpdfkitwindows-11ubuntu-22.04

Python/wkhtmltopdf PDF generation in Ubuntu generates a file size 20 times larger than the same file on Windows


I'm using Python 3.10's 'pdfkit' (1.0.0), which calls 'wkhtmltopdf' (0.12.6) in the background, to generate a PDF using code-generated HTML as the source.

The HTML has a base64 encoded ttf font embedded and there are six small (<1k) base64 encoded images per page along with text and divs. The PDF document is only 8 A4 pages.

The call to create the PDF is simply:

pdfkit.from_string(source_html, "filename.pdf")

Using Windows 11 the file created is ~450 KB

Using Ubuntu 22 the file created is ~9 MB

The files, when opened, are visibly identical so what is causing this file size discrepancy and how can I fix it?


Solution

  • I solved this today. The image heavy PDF was 3 times bigger on Ubuntu than on Windows.

    Firstly, I added the wkhtmltopdf option, "image-quality" to my from_string/from_file options dict.

    pdfkit.from_file("templatised.html", "out.pdf", options=options)
    

    I regression test this on Windows and it makes no noticeable difference. I set the quality the same as I have used for the constituent images. I dislike the thought of applying lossy encoding twice, but I chose wkhtmltopdf and I will stick with it.

    Now, on Linux the option is required, but the python-pdfkit repo warns:

    debian/ubuntu repos have reduced functionality (because it compiled without the wkhtmltopdf QT patches)

    This doesn't identify image-quality as missing functionality. Quality defaults to 94, which might explain the huge file sizes.

    Setting image-quality produces a warning when taking the OS-supplied wkhtmltopdf installation:

    The switch --image-quality, is not support using unpatched qt, and will be ignored

    The python-pdfkit link before, links to solutions to this, which are out of date. I could not find a solution for Debian bookworm, so had to revert to bullseye.

    Precompiled wkhtmltopdf binaries for each OS can be found here. wkhtmltopdf was archived January 2023. wkhtmltopdf/packaging was archived August 2023. I don't expect any updates.

    Here is the script from JazzCore to install the QT patched version:

    WKHTML2PDF_VERSION='0.12.6-1'
    
    sudo apt install -y build-essential xorg libssl-dev libxrender-dev wget
    wget "https://github.com/wkhtmltopdf/packaging/releases/download/${WKHTML2PDF_VERSION}/wkhtmltox_${WKHTML2PDF_VERSION}.bionic_amd64.deb"
    sudo apt install -y ./wkhtmltox_${WKHTML2PDF_VERSION}.bionic_amd64.deb
    

    They are targetting bionic_amd64 and 0.12.6-1. I want Debian bullseye and 0.12.6-2, which I had to make the appropriate changes to use.

    https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6.1-2/wkhtmltox_0.12.6.1-2.bullseye_amd64.deb

    Several other SO posts validated my approach, such as:

    How to install wkhtmltopdf with patched qt?

    See also https://pythonspeed.com/articles/base-image-python-docker-images/ for advice on migrating from bullseye-slim docker images to the more recent Ubuntu Jammy.