I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in.
I tried with some common python packages like PyPDF2, pdfplumber or pdfreader.The problem is that apparently they extract all the text present in the pdf, including those parts listed above in which I am not interested.
As an example:
from PyPDF2 import PdfReader
file = PdfReader(file)
page = file.pages[10]
text = page.extract_text()
This code will get me the whole text from page 11, including footers, box, text from a table and the number of the page, while what I would like is only the core text.
Unluckily the only solution I found up to now is to copy paste in another file the core text.
Is there any method/package which can automatically recognize the main text from the other parts of the pdf and return me only that?
Thank you for your help!!!
per D.L's comment, please add some reproducible code and, preferably, a pdf to work with.
However, I think I can answer at least part of your question. jsvine's pdfplumber is an incredibly robust python pdf processing package. pdfplumber contains a bounding box functionality that lets you extract text from within (.within_bbox(...)
) or from outside (.outside_bbox
) the 'bounding box' -- or geographical area -- delineated on the Page
object. Every character object extracted from the page contains location information such as y1 - Distance of top of character from bottom of page
and Distance of left side of character from left side of page
. If the majority of pages within the .pdf
you are trying to extract text from contain footnotes, I would recommend only extracting text above the y1
value. Given that footnotes are typically well below the end of a page, except for academic papers using Chicago Style citations, you should still be able to set a standard .bbox
for where you want to extract text (within a set .bbox
that does not include footnotes or out of a set .bbox
that does not include footnotes).
To your question about tables, that poses a trickier question. Tables are by far the trickiest thing to detect and/or extract from. pdfplumber offers, to my knowledge, the most robust open source table detection/extraction capabilities out there. To extract the area outside a table, I would call the .find_tables(...)
function on each Page
object to return a .bbox
of the table and extract around that. However -- this is not perfect. It is not always able to detect tables.
Regarding your 3rd question, how to exclude boxes, are you referring to text boxes? Please provide further clarification!
Finally -- to reiterate my first point -- pdfplumber is an incredibly robust package. That being said, extracting text from .pdf
files is really tough. Good luck -- please provide more information and I will be happy to help as best I can.