pythonpdf-generationpypdfreportlab

is there anything i can do with python for heavy PDF files to make them lighter?


Many PDFs are very heavy and slow maybe because it's loaded by huge graphical background, and there must be a way by a python script to make it lighter.

I have downloaded many PDF ebooks that's very heavy on my Foxit(pdf viewer) and XP system,(like this: https://ibb.co/2dN6V69). and I want to make them lighter and i don't know how to do that. can i use reprtlab or pypdf -or any other python library- to explore and analyze the layers or remove the background of the file. I just want the text and white background behind it.


Solution

  • The PDF in "Question" is a complex mix of post processing during/after scanning and archival.

    Many pages have different characteristics since they were generally split into 2 images and invisible OCR text. If you remove all the high density images, you will loose the text that was not converted and all the bad spellings will become apparent. enter image description here

    Removing just the background image destroys as much "readability" as removing the text masks.

    The excessive compression of dense textual images is causing pages to take 1.5 seconds in my fast viewer. That can be improved drastically by making the file larger via decompression.

    Slow rendering: 1519.26 ms, page: 6 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1428.18 ms, page: 7 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1326.82 ms, page: 4 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1324.60 ms, page: 3 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1345.35 ms, page: 2 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1517.27 ms, page: 1 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    Slow rendering: 1346.99 ms, page: 5 in 'Nadia Comaneci -- Miklowitz, Gloria D -- New York, 1977.pdf'
    

    To get a fine balance between rendering and decompression. The best solution is to rebuild the PDF using a PDF rewriter like GhostScript.

    By increasing the file and restructuring the images we can get a well rounded result, in terms of larger size and faster speed (optimisation).

    The best answer is thus by writing a loop to run all the files, via a shell call (here I am using windows). Adapt to your own programming language.

    gs -sDEVICE=pdfwrite -o"C:\output path\output.pdf" -dPDFSETTINGS=/screen -f "C:\input path\.pdf"
    

    Speed then becomes so fast, it is simply stated as less than a second to parse all 100 pages. Thus it may be slow once to do that decompression on all pages, but faster opening the PDF, every time thereafter.

    Even if I let a lossless pso compressor very very slowly reduce the file down to 8.48 MB (8,893,651 bytes) it will still be fast enough to skim read 100 pages in a few seconds!

    LoadDocument: 33.43 ms, 100 pages for C:\demo\Nadia.pso.pdf'
    DisplayModel::BuildPagesInfo started
    DisplayModel::BuildPagesInfo took 0.02 ms
    

    With these settings and that file the size increase is from 7.29 MB (7,648,368 bytes) up to 11.3 MB (11,851,869 bytes) Thus a very good result for only 4 MB extra compared to the "Source.pdf" which was stated as 203 MB when scanned.

    Both the text images will be retained, since they are converted to much more efficient JPEG at considerably reduced sizes.

    enter image description here