pythonpdffontspdf-extraction

CID encoding of font


I'am trying to extrat text from a pdf with python. None of the packages I tried could read it (PyPDF2,pdfminer,fitz etc.), but some of them could return me the cid encodings. (eg. (cid:3) ).

Now I read the file the "brute force" way, meaning I managed to found out the cid decoding from some examples. (That notebook can be found here on kaggle.)

I searched online for the elegant way, and found a lot of mentioning of Registry-Ordering-Supplement and how you should find the encodings by knowing the font.

Altough fitz can not interpret the text, it says the font is CourierNewPSMT. Now even with this information, I could not find the ROS info/ CID encoding/ CID mapping / CID collection.

Can someone tell me, how to interpret the cid encoded text, knowing the font?


Solution

  • What is needed is a PDF editor that recodes missing characters otherwise you may as well discard the plain text. So for such a task use the tools suited to the task, which here needs visual mapping of bad to expected. This took less time than shown here in a GUI editor remap dialog. Many are available but as commercially licensed (I think I paid about $15) I will not promote any one. enter image description here

    Once the characters are remapped it is easier to use Python extraction such as here to the console or to a file, or modify the PDF many other ways.

    enter image description here