I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs?
Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more.
import tika
from tika import parser
filename = "your file name here"
parsedPDF = parser.from_file(file_name)
print(parsedPDF['content'])
print(parsedPDF['metadata']) # its in a dictionary format.