pythonpdftext-mining

Is there any way to extract header and footer and title page of a PDF document?


I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs?


Solution

  • Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more.

    import tika
    from tika import parser
    
    filename = "your file name here"
    parsedPDF = parser.from_file(file_name)
    print(parsedPDF['content'])
    print(parsedPDF['metadata']) # its in a dictionary format.