pdfpdfboxapache-tika

What causes "Could not read ToUnicode CMap in font GoogleSans-Regular"


Not sure if the problem is in the file, probably, or PDFBox or something I'm doing. I think the file.

I'm getting:

"Could not read ToUnicode CMap in font GoogleSans-Regular"

java.io.IOException: java.lang.IllegalArgumentException: The start and the end values must not have different lengths. at org.apache.fontbox.cmap.CMapParser.parseBegincodespacerange(CMapParser.java:289) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:147) at org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:73) at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:218) at org.apache.pdfbox.pdmodel.font.PDFont.loadUnicodeCmap(PDFont.java:147) at org.apache.pdfbox.pdmodel.font.PDFont.(PDFont.java:115) at org.apache.pdfbox.pdmodel.font.PDType0Font.(PDType0Font.java:182) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:171) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541) at

ToUnicode looks like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<</Registry (Adobe)
/Ordering (Identity)
/Supplement 0
>> def
/CMapName /Adobe-Identity-H def
CMapType 2 def
1 begincodespacerange
<0000> <FFFFF>
endcodespacerange
0 beginbfchar
endbfchar
1 beginbfrange
<0003> <0037> [<0020> <0041> <0042> <0043> <0044> <0045> <0046> <0047> <0048> <0049> <004A> <004B> <004C> <004D> <004E> <004F> <0050> <0051> <0052> <0053> <0054> <0055> <0056> <0057> <0058> <0059> <005A> <0061> <0062> <0063> <0064> <0065> <0066> <0067> <0068> <0069> <006A> <006B> <006C> <006D> <006E> <006F> <0070> <0071> <0072> <0073> <0074> <0075> <0076> <0077> <0078> <0079> <007A>]
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

We process a lot of third party pdfs, so my interest is more in a general way to handle this, rather than a way to repair this one file. Can I tell it to assume unicode?


Solution

  • The problem is at

    <0000> <FFFFF>
    

    It should be

    <0000> <FFFF>
    

    You likely won't be able to get text extraction for that font. PDFBox tries some fallback strategies but it does not always work.