pdfunicodecharacter-encodinghebrewwindows-1255

Copy+pasting Hebrew text from PDF files results in final letters being incorrectly copied


So I got a few PDF files in Hebrew that I wanted to translate to English, and when trying to copy and paste the text from the PDF files into a text editor, all of the Hebrew final letters were incorrectly copied.

I found this question but no solution was found and that question was only talking about one specific final letter that was incorrectly read and it was only referring to a specific library.

I tried copying and pasting from both acrobat reader and the chrome PDF viewer but it failed copying the contents correctly with both of them.

Another interesting thing I found is that when you Ctrl+F in the browser (I tried it on chrome) and search for the final letter "Pe" for example, it would give results for both the regular "Pe" and the final "Pe" (and vice versa, when you search for the regular "Pe"), even though they have different code points (and different codes in the ANSI code page), which is also odd. (It's the same for all of the final letters and their corresponding regular letters)

So the question is - Does anyone know why this happens?
I get that there might be no actual code point mapped to the glyph but in that case how is it that the characters are rendered? I'm not very familiar with this subject so I would appreciate any explanation. In addition, any good solution that will allow me to extract the text with the final letters will be very very appreciated, since I would like to parse the text and having messed up letters results in incomplete words.

EDIT:
As requested by weibeld I'm adding a few copied words and the corresponding correct words. I'll also add their hexdump.

E1 F7 F8 1B    בקר.  # Should be בקרן (Final letter "Nun") Switches every 
final Nun with 1B instead of EF according to the windows 1255 code page.

F2 F1 F7 E9 E9 17 עסקיי. # Should be עסקיים (Final letter "Mem") Switches 
every final Mem with 17 instead of ED.  

Thanks!


Solution

  • So, based on your edit, the PDF file seems to use some strange (non-ASCII-compatible) Hebrew encoding for text extraction, which places the final forms of the letters in the 1X area where in ASCII the non-printable control characters are.

    If all you want is to reconstruct the text in the PDF, the easiest solution might be, not to change the PDF, but to replace the wrong codes with the correct ones after copying the text from the PDF.

    For example, paste the text copied from the PDF to file and then:

    cat file | tr '\033' '\357' | tr '\027' '\355' >out_file
    

    That is, one tr for each wrong final letter. The numbers 033, 357 etc. are just the octal forms of the hexadecimal bytes 1B, EF, etc., that you found out with hexdump. Just find out the remaining mappings and add them to the chain. Then out_file should contain the correctly encoded text and you can open it with some text editor using Windows-1255.