pythonparsingpdfcp1251

Parsing cp1251 pdf to text in python


Is there any way to extract text from the pdf file with russian text (cp1251)?

For parsing pdf files I am using pdfminer package. I tried to specify encoding in the argument to pdfminer.converter.TextConverter class but it didn't help.


Solution

  • If you want to parse the text further once extract it from PDF file you would need python... So just extract the text first without convert the text and save it in a txt file.

    You may use pdf2txt for this purpose (with unbuntu : http://manpages.ubuntu.com/manpages/precise/man1/pdf2txt.1.html)

    Then you open the file with python and you convert the text form cp1251 to utf-8, the accepted answer here will show you how to do :

    How to convert a string from CP-1251 to UTF-8?

    Then parse...