pythonpdfpdfplumber

How to save PDF after cropping from each page of PDF using pdfplumber?


I am using a PDF with multiple pages that has a table on top of each page that I want to get rid of. So I am cropping the PDF after the top table.

What I don't know is how to combine or save it as 1 single PDF after cropping it.

I have tried below:

import pandas as pd
import pdfplumber

path = r"file-tests.pdf"

with pdfplumber.open(path) as pdf:
    pages = pdf.pages
    
    # loop over each page
    for p in pages:
        print(p)

        # this will give us the box dimensions in (x0,yo,x1,y1) format
        bbox_vals = p.find_tables()[0].bbox

        # taking y1 values as to keep/extract the portion of pdf page after 1st table 
        y0_top_table = bbox_vals[3]
        print(y0_top_table)

        # cropping pdf page from left to right and y value taken from above box to bottom of pg
        p.crop((0, y0_top_table, 590, 840))

Output:

<Page:1>
269.64727650000003
<Page:2>
269.64727650000003
<Page:3>
269.64727650000003
<Page:4>
269.64727650000003
<Page:5>
269.64727650000003
<Page:6>
269.64727650000003
<Page:7>
269.64727650000003
<Page:8>
269.64727650000003
<Page:9>
269.64727650000003
<Page:10>
269.64727650000003
<Page:11>
269.64727650000003
<Page:12>
269.64727650000003
<Page:13>
269.64727650000003
<Page:14>
269.64727650000003
<Page:15>
269.64727650000003
<Page:16>
269.64727650000003
<Page:17>
269.64727650000003
<Page:18>
269.64727650000003
<Page:19>
269.64727650000003
<Page:20>
269.64727650000003

How do I append, save these cropped pages into 1 PDF?

Update:

Seems like its not possible to write or save pdf file using pdfplumber as per this discussion link

(Not sure why this question was degraded to negative. Person who do that should also provide the answer or link to where this is already answered).

Update2:

from pdfrw import PdfWriter
output_pdf =  PdfWriter() 

with pdfplumber.open(path) as pdf:
    pages = pdf.pages
    for p in pages:
        print(p)
        bbox_vals = p.find_tables()[0].bbox
        y0_top_table = bbox_vals[3]
        print(y0_top_table)
        cropped_pdf = p.crop((0, y0_top_table, 590, 840))
        print(type(cropped_pdf))
        output_pdf.addpage(cropped_pdf)

output_pdf.write(r"tests_cropped_file.pdf")

Output & Error:

<Page:1>
269.64727650000003
<class 'pdfplumber.page.CroppedPage'>

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[219], line 13
     11 cropped_pdf = p.crop((0, y0_top_table, 590, 840))
     12 print(type(cropped_pdf))
---> 13 output_pdf.addpage(cropped_pdf)

File c:\Users\vinee\anaconda3\envs\llma_py_3_12\Lib\site-packages\pdfrw\pdfwriter.py:270, in PdfWriter.addpage(self, page)
    268 def addpage(self, page):
    269     self._trailer = None
--> 270     if page.Type != PdfName.Page:
    271         raise PdfOutputError('Bad /Type:  Expected %s, found %s'
    272                              % (PdfName.Page, page.Type))
    273     inheritable = page.inheritable  # searches for resources

AttributeError: 'CroppedPage' object has no attribute 'Type'

Update 3:

Seems like this issue of cropping pdf and saving was also raised in 2018 but had no solution as per this discussion link.

If anyone knows workaround then pls let me know. Would really Appreciate !!!


Solution

  • pdfplumber 0.11.4 pillow 9.5.0

    Actually, it is possible to crop and save data as PDF with pdfplumber, but only if you don't need further data extraction.

    Let's say, that you want to supply someone with depersonalized medical document for visual reference, no further processing of the data is expected. In this case, you could crop pages and save them as images in PDF like follows (note that in your sample document, personal info is located within the first rectangle on a page):

    import pdfplumber
    
    source_path = '.../sample_report.pdf'
    destination_path = 'data.pdf'
    
    pdf = pdfplumber.open(source_path)
    cropped_pages = []
    for page in pdf.pages:
        x0, x1 = 0, page.width
        y0, y1 = page.rects[0]['bottom'], page.height
        cropped_pages.append(page.crop([x0, y0, x1, y1])
                             .to_image(resolution=400)
                             .annotated)
    
    cropped_pages[0].save(destination_path, 
                          save_all=True, 
                          append_images=cropped_pages[1:])
    

    It can be done because page.to_image().annotated is a Pillow Image object, which in turn can be saved as PDF with additional images passed as a append_images parameter (save_all=True is required in this case).