pythonpdfencodingpdfminer

PDFminer gives strange letters


I am using python2.7 and PDFminer for extracting text from pdf. I noticed that sometimes PDFminer gives me words with strange letters, but pdf viewers don't. Also for some pdf docs result returned by PDFminer and other pdf viewers are same (strange), but there are docs where pdf viewers can recognize text (copy-paste). Here is example of returned values:

from pdf viewer: ‫فتــح بـــاب ا�ستيــراد البيــ�ض والدجــــاج المجمـــد‬ from PDFMiner: óªéªdG êÉ````LódGh ¢†``«ÑdG OGô``«à°SG ÜÉH í``àa

So my question is can I get same result as pdf viewer, and what is wrong with PDFminer. Does it missing encodings I don't know.


Solution

  • Yes.

    This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly.

    pdfminer gives garbage output in such cases because encoding is required to interpret the text