python-3.xpdfpypdfpdf-extraction

Extract author names in the PDF using Python


I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages.

I have multiple PDF files which the same format where I need to extract the author names.

Here is the link for PDF pdf file

Below is the image where the first page of PDF looks like

enter image description here

I need to extract the author names which is in bold color. I am using the below code to extract

import PyPDF2
import re
file = 'pdf_file'
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
pdf_text_from_paper = page.extract_text()
emails_pattern  = r"\{([^}]+)\}"
email_matches = re.findall(emails_pattern, pdf_text_from_paper)

I could able to extract the emails but not the names. Can anyone tell on how to extract the names?


Solution

  • I am not certain that this will work for all of your pdfs, but this at least works for the one you linked to in your question and if they are all the same format then it could work on the others as well.:

    pattern = re.compile(r'\s{4}(?!Introduction)(\w+\s\w*?\.?\s?\w*?)\s{2}')
    matches = pattern.findall(page)
    print(matches)
    

    output

    ['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
    

    EDIT

    This pattern works on both pdfs you linked to.

    pattern = re.compile(r'1\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n|\{\w+?.*?@.*?\}\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n')
    for doc in ["document1.pdf", "document2.pdf"]:
        reader = PyPDF2.PdfReader(doc)
        page = reader.pages[0]
        text = page.extract_text()
        matches = pattern.findall(text)
        print([j for i in matches for j in i if j])
    

    OUTPUT:

    ['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
    ['Honglin Deng', 'Weiquan W ang', 'Siyuan Li', 'Kai H. Lim']