pythonpdfscriptingscrapypypdf

Extract text from PDF File using Python with PyPDF2


I want to extract text from a given PDF (linked below).

The code used is:

from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        number_of_pages = pdf.getNumPages()
        for pages in range(number_of_pages):
            page=pdf.getPage(pages)
            page_content=page.extractText()
            print(page_content)
 

if __name__ == '__main__':
    path = 'test.pdf'
    extract_information(path)

but when I run the above code I get the following output:

PS E:\Omkar\Coding\Python\pdfSearch> python .\scrape.py
 !"#$%&!'()*+&,$ !")-!+)-. !"#$%$&'$%%()%*)(+(+$,-,.-+/ 0 1234#5$&3-6#3#1!4#5$78-$0#5"#3$9:;;#<$=-$%(+,(>(?/0&1(+$2(3)-4!+&)(@15#123"$ A8B-C9D;E:F0G$;@HFI%*,JJ>*%J/H F=-D2K#3B#=->.J*EKK4=- 1#L#342L#$M!152!K$M!1#$M&1NO?JP%%$D9QQ9;IR$SDTC$*E
;FM:0@HC$:FDDG$HU$%%/%?
V>%?W*%JPJ?++ A&3#=%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9DQ!Y=V?,,W>J/P/*,/H!Z#-X:ED@@G$0FM:E9DR#Y-$0C@S-$+*)%+)%..* A&3#-$*/>,,J(?*>F3$M!1#$@'-X:ED@@G$0FM:E9D$
E551#BB-(*?$M9CE;:[;RI$ET9$%S
 !42#34$FC-$,.>>J>?C2!"$M&5#B-M&N8$;#N&14\(+O?(?\>%O.
C!4#$M&]]#K4#5-I2Z#$M&]]#K4#5-
Q!B423"-$^_I2Z#5$[123#$M&]]#K42&3-$^$$$_
H&3$'!B423"-$^`$_T&]aZ#-
M!]]$;#Ba]4B-$^$$$_M&ZZ#34B- !42#34-F3Ba1!3K#-M]2#34-0#52K!25-0#52K!1#-;!2]1&!5$0M;-
F3Ba1#5$H!Z#-F3Ba1!3K#$ ]!3-9ZN]&8#1)61&aN$H!Z#- &]2K8=-61&aN) ]!3=-%()%+)(+%%-+>$!Z
`X:ED@@G$0FM:E9D$
;#]!42&3BA2N-R#]'bXJJ>(,H$$$5!+&1(+$2(3)-4!+&)(2(*6-!(,1$2(3)-4!+&)(%&!'()*&*$/)71*891,&41($2(3)-4!+&)(;VRRIW6US6;UDSMVS]&&5$Ma]4W
:-71-17$;1*+*
M9CE;:[;RI$HU$%%,%J
09I;@ D[R$0MC$^/>>%(_$ O@O$S@`$%.JJ$H9c$U@;X$HU$
%+%%J%.JJ
OM@TFC%.$RE;RPM@T($$`$$(+(/ A8O$H!Z#-C9D;E:F0G$;@HFI A8B2K2!3$$R2"3!4a1#-

I think this has to be something related to the encoding used in the PDF but I am not able to understand this.

Google Drive link to the PDF used


Solution

  • To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text

    https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052