pythonpdf-scraping

No text is returned when pypdf2 is used to scrape a one paged pdf


I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157

Now I want to scrape the PDF's by using PyPDF2, however no text is returned.

I tested the code with another pdf and it worked without a problem.

all_files = os.listdir('C:/Users/NAME.NAME/Downloads/Eu/T/')
count=0
count2=0
for filenames in all_files: 
   count +=1
   file_path='C:/Users/NAME.NAME/Downloads/Eu/T/'+filenames
   pdf_obj=open(file_path, 'rb')
   pdf_reader = PyPDF2.PdfFileReader(pdf_obj)
   num_pages = pdf_reader.numPages
   current_page=0
   text2=""
   pageObj= pdf_reader.getPage(current_page)
   text2 +=pageObj.extractText()

Solution

  • This is because PyPDF2 is a inconsistent scraper . You have to remember that not all pdfs are built the same, so based on the structure that the pdf was built PyPDF2 may or may not be able to scrape it.

    Usually when I am scraping pdfs, I have to switch between PyPDF2, pdfminer, and slate3k depending on if I get text using PyPDF2 or not. I start with PyPDF2 since it is the easiest in my opinion.

    My order of robustness (how well the package can scrape pdfs):

    1.) pdfminer

    2.) slate3k

    3.) PyPDF2

    Using slate3k:

    import glob as glob
    all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
    for filenames in glob.glob(all_files): 
        with open(filenames,'rb') as f:
           pdf_text = slate.PDF(f)
           print(text)
    

    Using pdfminer

    import glob as glob
    import io
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    
    
    def convert_pdf_to_txt(path):
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(path, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos = set()
    
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)
    
        text = retstr.getvalue()
    
        fp.close()
        device.close()
        retstr.close()
        return text
        
    all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
        
    for files in glob.glob(all_files):
        convert_pdf_to_txt(files)   
    
     
    

    You may need to change the functions to get the text in the format you want it in. As I said since PDFs can be built in so many ways your text can be outputted in numerous different ways. But this should get you in the right direction.