I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages.
I have multiple PDF files which the same format where I need to extract the author names.
Here is the link for PDF pdf file
Below is the image where the first page of PDF looks like
I need to extract the author names which is in bold color. I am using the below code to extract
import PyPDF2
import re
file = 'pdf_file'
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
pdf_text_from_paper = page.extract_text()
emails_pattern = r"\{([^}]+)\}"
email_matches = re.findall(emails_pattern, pdf_text_from_paper)
I could able to extract the emails but not the names. Can anyone tell on how to extract the names?
I am not certain that this will work for all of your pdfs, but this at least works for the one you linked to in your question and if they are all the same format then it could work on the others as well.:
pattern = re.compile(r'\s{4}(?!Introduction)(\w+\s\w*?\.?\s?\w*?)\s{2}')
matches = pattern.findall(page)
print(matches)
output
['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
This pattern works on both pdfs you linked to.
pattern = re.compile(r'1\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n|\{\w+?.*?@.*?\}\s+?(\w+\s\w*?\.?\s?\w*?)\s*?\n')
for doc in ["document1.pdf", "document2.pdf"]:
reader = PyPDF2.PdfReader(doc)
page = reader.pages[0]
text = page.extract_text()
matches = pattern.findall(text)
print([j for i in matches for j in i if j])
OUTPUT:
['Jiaheng Xie', 'Xiao Liu', 'Daniel Dajun Zeng', 'Xiao Fang']
['Honglin Deng', 'Weiquan W ang', 'Siyuan Li', 'Kai H. Lim']