Problem with /ToUnicode map only on macOS Preview

I have a weird issue with a PDF and the contained /ToUnicode CMap that only affects macOS Preview, all other tested viewers work fine. The thing is I don't know whether the contained /ToUnicode CMap is at fault or Preview.

Here is the PDF in question: https://github.com/user-attachments/files/19538203/example.pdf and the Github issue where this problem popped up.

If that PDF is opened in macOS Preview and the text selected and copied, everything after "Hello from HexaPD" is wrong. Other viewers copy the whole text just fine.

Current status (edited):

HexaPDF, the library generating the PDF, is using an optimization that avoids creating character codes containing the ASCII characters \r, (, ) and \. The reason is that those would need to be escaped when serializing as PDF literal string.
If this optimization is turned off, the resulting file (see https://github.com/user-attachments/files/19575820/example.pdf) works perfectly in macOS Preview (i.e. copy and paste works).
Removing the /ToUnicode CMap entirely leads to uncopyable text. This means that macOS Preview is indeed using this CMap and that it is the most likely culprit.
Adding a dummy entry <0000><0000> doesn't work.
Adding a dummy entry like <000D><0044> to the /ToUnicode CMap doesn't work.
Starting the character codes not at 1 but at 14 leads to the first 13 characters being invalid, i.e. it makes the situation worse.
After reading the respective parts of the PDF specification and the "5014 Adobe CMap and CIDFont Files Specification" I think that the /ToUnicode CMap in both linked files above is correct.

Any insights into whether the generated /ToUnicode CMap is invalid or whether it is macOS Preview's fault are appreciated!

Solution

I believe I now understand the problem, and I am reasonably confident this is a mistake in Apple Preview.

Explaining this is, unfortunately, complicated.....

The PDF file uses an embedded, subset, font. As is common the font only contains the glyphs (the actual descriptions of the character shapes) used by the PDF file. As is also common, the 'encoding' is such that the first character used gets character code 1, the second one gets character code 2 and so on.

Encodings in PDF are somewhat akin to code pages in Windows, or ASCII; they map numeric values to specific characters.

In the case of this file the font is actually a CIDFont, which complicates matters because these kinds of encodings can have variable sizes, again rather like UTF-8, the number of bytes needed for a code varies. Fortunately for us, in this case all the codes are two bytes.

The CMap is the glue that joins all this together; it determines how many bytes of input are required to map to particular characters. The CMap takes character codes and returns CIDs; if you are using a CIDFont, then the CID is the 'index' into the font that finds a particular glyph program. If your font is a TrueType font (as is the case here) then the CIDToGIDMap translates CIDs into GIDs (because that's what TrueType fonts use). Again fortunately for us the CIDToGIDMap is /Identity. Nice and simple.

Now the important part of the CMap looks like this:

1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 begincidrange
<0001><000C> 1
<000E><001D> 13
endcidrange

So that says the code space (valid values) is from 0 to 0xFFFF, within that the CMap defines two ranges of numbers. The first range is 0x01 to 0x0C and that maps to CID 1 (so 0x01 =1, 0x02 = 2 and so on). The second range is from 0x0E to 0x1D and these map to CIDs starting from 13.

So far so good. But how does that get us to copy and paste ? Well the answer is that it doesn't. There is an optional table called the ToUnicode CMap. This may or may not be present, if it is, then PDF consumers can reliably discover what Unicode code point a given character code maps to. If it isn't there then it's pure guesswork. In a file like this, with a custom mapping and subset font, it simply wouldn't be possible to determine the Unicode values.

Luckily for us, there s a ToUnicode CMap:

1 begincodespacerange
<0000> <FFFF>
endcodespacerange
24 beginbfchar
<0001><0048>
<0002><0065>
<0003><006c>
<0004><006f>
<0005><0020>
<0006><0066>
<0007><0072>
<0008><006d>
<0009><0078>
<000A><0061>
<000B><0050>
<000C><0044>
<000E><0046>
<000F><002e>
<0010><0054>
<0015><0076>
<0016><0079>
<0017><0063>
<0018><0070>
<0019><0026>
<001A><0075>
<001B><0077>
<001C><006e>
<001D><0021>
endbfchar
2 beginbfrange
<0011><0012><0068>
<0013><0014><0073>
endbfrange

This is, basically, the same as the earlier CMap. You can see from this that character code 1 maps to Unicode code point U+0048. That's a capital 'H', as I mentioned right at the start the character codes are assigned as they are used, the first character is an 'H' and that's assigned to character code 1.

So the 'text' in the PDF file is actually stored (as K J said) as binary data. Because the file was produced in a way which avoids the use of escaped characters we avoid the use of 0x0D, which means the 'text' looks like:

0001 0002 0003 0003 0004 0005 0006 0007 0004 0008 0005 0001 0002 0009 000A 000B 000C 000E

1 = H, 2 = e, 3 = l, 4 = o, 5 = ' ', 6 = f etc.

So we take the character codes and look up the ToUnicode CMap. That yields

U+0048, U+0065, U+006C, U+006C and so on. The important point is that we use the character code to look up the ToUnicode CMap.

So why does Apple Preview get it wrong ? Well it 'seems to be' applying the CMap from the font to the character code to get a CID then using the CID to look up the ToUnicode CMap. The problem with that is that the CIDs run contiguously from 1 to 28, but the character codes run from 1 to12 and then from 14 to 29.

This causes two faults; firstly CID 13 has no entry in the ToUnicode CMap and secondly all the CIDs beyond 13 are 'off by 1'. CID 14 gets the Unicode code point assigned to character code 13 and so on.

I've modified the original failing example so that the font CMap maps CIDs correctly for rendering, and modified the ToUnicode CMap so that it maps correctly when CIDs are used for the lookup instead of character codes. That file is here:

https://www.dropbox.com/scl/fi/ua6zzr8hr0hlazaf1f8yl/preview.pdf?rlkey=e6amwr722xjtrn2o9b44p6r7u&st=ifq46ngw&dl=0

That file copy/pastes correctly from Apple Preview (well, it does on my elderly MacOS, Big Sur). It will not copy/paste correctly from any conforming PDF consumer because obviously the ToUnicode CMap is set up so that CIDs are used instead of character codes.

In short; Apple Preview is doing the lookup incorrectly. The only way to have Apple Preview and a conforming PDF consumer both get the correct result is for the character codes and CIDs to be the same, so that the ToUnicode works no matter whether a character code or CID is used to do the lookup.