pythonpython-3.xpdfplumber

How to stop pdfplumber from reading the header of every pages?


I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I program pdfplumber to not read the page headers(titles) and the page numbers(or the footer, if possible) ?

Here is code :

import pdfplumber

all_text = ""

pdf = pdfplumber.open(file)
for pdf_page in pdf.pages:
    one = pdf_page.extract_text()
    all_text = all_text + '\n' + str(one)
    print(all_text)

where file is the PDF Document...


Solution

  • I don't think you can.

    However, you can crop the document with the crop method. This way, you can extract the text only for the cropped part of page, leaving out headers and footers. Of course this method requires that you know in advance the height of headers and footers.

    Here is the explanation of coords:

    x0 = % Distance of left side of character from left side of page.
    top = % Distance of top of character from top of page.
    x1 = % Distance of right side of character from left side of page.
    bottom = % Distance of bottom of the character from top of page.
    

    Here is the code:

    # Get text of whole document as string
    crop_coords = [x0, top, x1, bottom]
    text = ''
    pages = []
    with pdfplumber.open(filename) as pdf:
        for i, page in enumerate(pdf.pages):
            my_width = page.width
            my_height = page.height
            # Crop pages
            my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
            page_crop = page.crop(bbox=my_bbox)
            text = text+str(page_crop.extract_text()).lower()
            pages.append(page_crop)