[SOLVED] Ghostscript pdf conversion makes ligatures unable to copy & paste

Ghostscript pdf conversion makes ligatures unable to copy & paste

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:

gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite 
   -dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB 
   -sColorConversionStrategy=UseDeviceIndependentColor 
   -dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None 
   -dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true 
   -sOutputFile=main_new.pdf main.pdf

While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".

Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?

I'm using ghostscript 9.27 on macOS 10.14.

Solution

Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.

If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.

My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.