javapdfboxpdfa

How to create the simplest possible PDFA 2b with Apache PDFBox that is compliant according to VeraPDF?


I am using Apache PDFBox to create a very simple pdf with one line of text with conformance to PDFA 2b and I want to use VeraPDF to check this pdf for conformance. Vera is telling me, that the pdf is not compliant and shows me two failed assertions:

My code looks something like this:

try (ByteArrayOutputStream baos = new ByteArrayOutputStream(); PDDocument document = new PDDocument(); COSStream cosStream = new COSStream()) {
    PDPage page = new PDPage();
    document.addPage(page);

    PDDocumentInformation documentInformation = new PDDocumentInformation();
    documentInformation.setTitle("Name");
    documentInformation.setCreator("Creator");
    documentInformation.setSubject("Subject");
    document.setDocumentInformation(documentInformation);

    try (ByteArrayOutputStream xmpOutputStream = new ByteArrayOutputStream(); OutputStream cosXMPStream = cosStream.createOutputStream()) {
        XMPMetadata xmp = XMPMetadata.createXMPMetadata();
        PDFAIdentificationSchema pdfaSchema = xmp.createAndAddPFAIdentificationSchema();
        pdfaSchema.setPart(2);
        pdfaSchema.setConformance("B");
        DublinCoreSchema dublinCoreSchema = xmp.createAndAddDublinCoreSchema();
        dublinCoreSchema.setTitle("Name");
        dublinCoreSchema.addCreator("Creator");
        dublinCoreSchema.setDescription("Subject");
        XMPBasicSchema basicSchema = xmp.createAndAddXMPBasicSchema();
        Calendar creationDate = Calendar.getInstance();
        basicSchema.setCreateDate(creationDate);
        basicSchema.setModifyDate(creationDate);
        basicSchema.setMetadataDate(creationDate);
        basicSchema.setCreatorTool("Creator Tool");
        new XmpSerializer().serialize(xmp, xmpOutputStream, true);
        cosXMPStream.write(xmpOutputStream.toByteArray());
        document.getDocumentCatalog().setMetadata(new PDMetadata(cosStream));
    }

    PDViewerPreferences prefs = new PDViewerPreferences(page.getCOSObject());
    prefs.setDisplayDocTitle(true);
    document.getDocumentCatalog().setViewerPreferences(prefs);

    File fontFile = new File("C:\\Windows\\Fonts\\arial.ttf");
    PDType0Font font = PDType0Font.load(document, fontFile);

    PDPageContentStream contentStream = new PDPageContentStream(document, page);
    contentStream.beginText();
    contentStream.setFont(font, 12);
    contentStream.newLineAtOffset(100, 700);
    contentStream.showText("Hello PDF/A-2b World!");
    contentStream.endText();
    contentStream.close();

    document.save(baos);
    try (PDFAParser parser = Foundries.defaultInstance().createParser(new ByteArrayInputStream(baos.toByteArray()), PDFAFlavour.PDFA_2_B)) {
        PDFAValidator validator = Foundries.defaultInstance().createValidator(PDFAFlavour.PDFA_2_B, false);
        ValidationResult result = validator.validate(parser);
        System.out.println(result.isCompliant());
    }
}

When I inspect the generated PDF with debugger-app-2.0.31.jar, I can find the metadata. When I compare the metadata with a pdf file from the regression test from VeraPDF (eg. this one), the only difference that seems relevant to me is in the begin="" tag. It is empty in the vera test file <?xpacket begin='' and it seems to contain the BOM Start Sequence in the file created by pdfbox <?xpacket begin="".

Is someone able to tell me, if this is an error in VeraPDF or in PDFBox? Is there a solution for this problem? Can someone explain the second error to me and offer an solution?


Solution

  • The CreatePDFA example from the source code does the metadata part slightly differently although yours looks ok (oops no, see update), and I was able to validate it with VeraPDF:

    XmpSerializer serializer = new XmpSerializer();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    serializer.serialize(xmp, baos, true);
    
    PDMetadata metadata = new PDMetadata(doc);
    metadata.importXMPMetadata(baos.toByteArray());
    doc.getDocumentCatalog().setMetadata(metadata);
    

    The second problem is the missing output intent. Add this code:

    // sRGB output intent
    InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
            "/org/apache/pdfbox/resources/pdfa/sRGB.icc");
    PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
    intent.setInfo("sRGB IEC61966-2.1");
    intent.setOutputCondition("sRGB IEC61966-2.1");
    intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
    intent.setRegistryName("http://www.color.org");
    doc.getDocumentCatalog().addOutputIntent(intent);
    

    About the PDFMergerExample and your original code:

    That example and you used new PDMetadata(cosStream). This constructor doesn't add two mandatory dictionary entries. Add this to your code:

    cosStream.setName(COSName.TYPE, "Metadata");
    cosStream.setName(COSName.SUBTYPE, "XML");