javapdfclown

Pdfclown-Few different fonts of pdf files are not recognizing and also i am getting exceptions


I am facing issue with pdfclown frequently when few pdf files are non english and thier fonts are not recognizing and also i am getting below exception.Please find the pdf path and code path.Load encoding method is failing in both CompositeFont.java and SimpleFont.java. And is there any specific version of jar i need to use for to resolve this issue. Please provide your inputs for to support such pdf files.

java.lang.NullPointerException
    at org.pdfclown.documents.contents.fonts.CompositeFont.loadEncoding(CompositeFont.java:178)
    at org.pdfclown.documents.contents.fonts.CompositeFont.onLoad(CompositeFont.java:202)
    at org.pdfclown.documents.contents.fonts.Font.load(Font.java:878)
    at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:368)
    at org.pdfclown.documents.contents.fonts.CompositeFont.<init>(CompositeFont.java:114)
    at org.pdfclown.documents.contents.fonts.Type0Font.<init>(Type0Font.java:62)
    at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:268)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
    at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
    at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
    at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
    at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
    at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1360)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:819)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:771)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:764)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:684)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:676)
    at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1184)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:636)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:299)
    at pdfclown2.highlight(pdfclown2.java:89)
    at pdfclown2.main(pdfclown2.java:48)

*****************************other pdf issue*********************************************

java.lang.NullPointerException
    at org.pdfclown.documents.contents.fonts.SimpleFont.loadEncoding(SimpleFont.java:150)
    at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:170)
    at org.pdfclown.documents.contents.fonts.Font.load(Font.java:878)
    at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:368)
    at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:65)
    at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:47)
    at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:262)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
    at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
    at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
    at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
    at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
    at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1360)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:819)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:771)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:764)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:684)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:676)
    at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1184)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:636)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:645)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:299)
    at pdfclown2.highlight(pdfclown2.java:89)
    at pdfclown2.main(pdfclown2.java:48)

*********************************another issue**************************************

java.lang.RuntimeException: Odd number of characters.
    at org.pdfclown.util.ConvertUtils.hexToByteArray(ConvertUtils.java:106)
    at org.pdfclown.objects.PdfString.setValue(PdfString.java:287)
    at org.pdfclown.objects.PdfString.<init>(PdfString.java:126)
    at org.pdfclown.objects.PdfByteString.<init>(PdfByteString.java:58)
    at org.pdfclown.documents.contents.tokens.ContentParser.parsePdfObject(ContentParser.java:182)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseOperation(ContentParser.java:164)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:98)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
    at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
    at org.pdfclown.documents.contents.Contents.load(Contents.java:598)
    at org.pdfclown.documents.contents.Contents.<init>(Contents.java:372)
    at org.pdfclown.documents.contents.Contents.wrap(Contents.java:351)
    at org.pdfclown.documents.Page.getContents(Page.java:585)
    at org.pdfclown.documents.contents.ContentScanner.<init>(ContentScanner.java:1056)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:300)
    at pdfclown2.highlight(pdfclown2.java:3124)
    at pdfclown2.main(pdfclown2.java:50)

Solution

  • NullPointerException in SimpleFont.loadEncoding

    I can reproduce the NullPointerException in SimpleFont.loadEncoding using your example file "Sample_Report.pdf". This is caused by an error in the PDF, some font dictionaries in there are missing required entries.

    I cannot reproduce the other two exceptions using "Sample_Report.pdf", though. Thus, I'll focus on the reproducible issue.

    The cause

    In your example PDF there are some simple fonts which lack the required FirstChar entry, e.g.:

    32 0 obj
    <<
      /Name /EvoPdf_meeaimambhggkplibeicinfbamefiocn
      /Subtype /TrueType
      /FontDescriptor 37 0 R
      /Widths [0 507 507 507 507 507 507 507 507 507 507 507 507 0 507 507 507 507 507 507 507 507 507 507 507 507 507 507 507 507 507 507 226 326 401 498 507 715 682 221 303 303 498 498 250 306 252 386 507 507 507 507 507 507 507 507 507 507 268 268 498 498 498 463 894 579 544 533 615 488 459 631 623 252 319 520 420 855 646 662 517 673 543 459 487 642 567 890 519 487 468 307 386 307 498 498 291 479 525 423 525 498 305 471 525 230 239 455 230 799 525 527 525 525 349 391 335 525 452 715 433 453 395 314 460 314 498 507 507 507 250 305 418 690 498 498 395 1038 459 339 867 507 468 507 507 250 250 418 418 498 498 905 450 705 391 339 850 507 395 487 226 326 498 507 498 507 498 498 393 834 402 512 498 306 507 394 339 498 336 334 292 550 586 252 307 246 422 512 636 671 675 463 579 579 579 579 579 579 763 533 488 488 488 488 252 252 252 252 625 646 662 662 662 662 662 498 664 642 642 642 642 487 517 527 479 479 479 479 479 479 773 423 498 498 498 498 230 230 230 230 525 525 527 527 527 527 527 498 529 525 525 525 525 453 525 453 ]
      /Encoding /WinAnsiEncoding
      /Type /Font
      /BaseFont /Calibri
      /LastChar 255 >>
    endobj 
    

    According to the PDF specification ISO 32000-1 (and similarly ISO 32000-2, too), TrueType font dictionaries contain the same entries as Type1 font dictionaries (with certain differences irrelevant to the case at hand), and the section on Type1 fonts specifies:

    FirstChar integer (Required except for the standard 14 fonts) The first character code defined in the font’s Widths array.

    The font above is not a standard 14 font. Thus, it is required to have a FirstChar entry. It does not. Thus, this font definition is broken.

    PDF Clown, on the other hand, expects PDFs to follow the specification. So it simply retrieves the FirstChar value from the font and immediately uses it which results in the NullPointerException.

    A work-around

    One can make PDF Clown a bit more lax by making it default to 0 in its SimpleFont FirstChar lookups. There are two such lookups.

    In SimpleFont.loadEncoding() replace

    ByteArray charCode = new ByteArray(new byte[]{(byte)((PdfInteger)getBaseDataObject().get(PdfName.FirstChar)).getIntValue()});
    

    by

    PdfInteger firstCharObject = (PdfInteger)getBaseDataObject().get(PdfName.FirstChar);
    ByteArray charCode = new ByteArray(new byte[]{(byte)(firstCharObject != null ? firstCharObject.getIntValue() : 0)});
    

    and in SimpleFont.onLoad() similarly replace

    ByteArray charCode = new ByteArray(
        new byte[]
            {(byte)((PdfInteger)getBaseDataObject().get(PdfName.FirstChar)).getIntValue()}
    );
    

    by

    PdfInteger firstCharObject = (PdfInteger)getBaseDataObject().get(PdfName.FirstChar);
    ByteArray charCode = new ByteArray(
        new byte[]
            {(byte)(firstCharObject != null ? firstCharObject.getIntValue() : 0)}
    );
    

    as it already has been done here.

    NullPointerException in CompositeFont.loadEncoding

    I can reproduce the NullPointerException in CompositeFont.loadEncoding using your example file "UnicodeTest.pdf". These exceptions are caused by missing encoding CMaps in PDF Clown.

    There is a number of Encodings primarily for CJK languages which a conforming PDF processor is expected to support but which PDF libraries (in particular those developed in Europe or the Americas) often don't support out of the box.

    PDF Clown expects such encoding CMaps as resources in /fonts/cmap/ in the pdfclown.jar; by default, though, only the generic CMaps Identity-H and Identity-V are there, and none of the specific Chinese/Japanese/Korean CMaps.

    You can add the required CMaps to the pdfclown.jar by adding them to the main\res\pkg\fonts\cmap\ folder of the PDF Clown project and building the jar file.

    You can retrieve all CMaps from the adobe-type-tools/cmap-resources project on github, simply traverse the folder structure of that project and collect the files from the CMap subfolders.

    In case of your example file the CMaps UniCNS-UTF16-H, UniGB-UTF16-H, UniJIS-UTF16-H, and UniKS-UTF16-H sufficed but for an application working with arbitrary PDF files you probably should add all encoding CMaps.