pythonexcelpdfpdf-scraping

How to scrape data from PDF into Excel


I am trying to scrape the data from PDF and get it saved into an excel file. This is the pdf I needed: https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf

However, I need to scrape not all the data but the following one (below), and then saved it to excel in different cells: From page 5, starting from P001 to and including Introduction - there is a P number, title, people names, and Introduction.

For now, I can only convert a PDF file into text (my code below) and save it all in one cell, but I need it to be separated into a different cells

import PyPDF2 as p2

PDFfile = open('Abstract Book from the 5th World Psoriasis and Psoriatic Arthritis 
Conference 2018.pdf', 'rb')
pdfread = p2.PdfFileReader(PDFfile)

pdflist = []

i = 6
while i<pdfread.getNumPages():
  pageinfo = pdfread.getPage(i)
  #print(pageinfo.extractText())
  i = i + 1

  pdflist.append(pageinfo.extractText().replace('\n', ''))

print(pdflist)

Solution

  • The main you need is 'header' regex as 15 UPPERcase letters and 'article' regex letter 'P' and 3 digits. One more regex helps you to divide your text by any of keywords

    article_re = re.compile(r'[P]\d{3}')  #P001: letter 'P' and 3 digits
    header_re = re.compile(r'[A-Z\s\-]{15,}|$')  #min 15 UPPERCASE letters, including '\n' '-' and
    key_word_delimeters = ['Peoples', 'Introduction','Objectives','Methods','Results','Conclusions','References']
    
    file = open('data.pdf', 'rb')
    pdf = pdf.PdfFileReader(file)
    
    text = ''
    
    for i in range(6, 63):
        text += pdf.getPage(i).extractText()  # all text in one variable
    
    articles = []
    
    for article in re.split(article_re, text):
        header = re.match(header_re, article)  # recieving a match
        other_text = re.split(header_re, article)[1]  # recieving other text
        if header:
            header = header.group()            # get text from match
            item = {'header': header}
            first_name_letter = header[-1]     # save the first letter of name to put it in right position. Some kind of HOT BUGFIX
            header = header[:-1]               # cut last character: the first letter of name
            header = header.replace('\n', '')  #delete linebreakers
            header = header.replace('-', '')   #delete line break symbol
            other_text = first_name_letter + other_text
            data_array = re.split(
                'Introduction:|Objectives:|Methods:|Results:|Conclusions:|References:',
                other_text)
    
            for key, data in zip(key_word_delimeters, data_array):
                item[key] = data.replace('\n', '')
            articles.append(item)