I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157
Now I want to scrape the PDF's by using PyPDF2, however no text is returned.
I tested the code with another pdf and it worked without a problem.
all_files = os.listdir('C:/Users/NAME.NAME/Downloads/Eu/T/')
count=0
count2=0
for filenames in all_files:
count +=1
file_path='C:/Users/NAME.NAME/Downloads/Eu/T/'+filenames
pdf_obj=open(file_path, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_obj)
num_pages = pdf_reader.numPages
current_page=0
text2=""
pageObj= pdf_reader.getPage(current_page)
text2 +=pageObj.extractText()
This is because PyPDF2 is a inconsistent scraper . You have to remember that not all pdfs are built the same, so based on the structure that the pdf was built PyPDF2 may or may not be able to scrape it.
Usually when I am scraping pdfs, I have to switch between PyPDF2, pdfminer, and slate3k depending on if I get text using PyPDF2 or not. I start with PyPDF2 since it is the easiest in my opinion.
My order of robustness (how well the package can scrape pdfs):
1.) pdfminer
2.) slate3k
3.) PyPDF2
Using slate3k:
import glob as glob
all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
for filenames in glob.glob(all_files):
with open(filenames,'rb') as f:
pdf_text = slate.PDF(f)
print(text)
Using pdfminer
import glob as glob
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
for files in glob.glob(all_files):
convert_pdf_to_txt(files)
You may need to change the functions to get the text in the format you want it in. As I said since PDFs can be built in so many ways your text can be outputted in numerous different ways. But this should get you in the right direction.