javapdfitextitext7

Replace given strings in a pdf keeping the original styling


Consider the following situation; I have a pdf containing the following lines -

line 1: I love to **write** code
line 2: I love to write *java* code
line 3: I love to write java code that replaces some texts(underline) in pdf.

at line 1, write is bold having Arial font.
at line 2, java is italic having Nunito font.
at line 3, texts is underlined having Times New Roman font.

What I am trying to is to replace
write with check
java with perl
texts with words
keeping the exact same font and styling each one has. I have been trying to achieve this using Itext7 java library and gone through a lots of resources on SO, blogs and books but none fulfill my exact requirements.

So far, I can replace given words in a pdf with the same font (if the pdf contain only one font). Though the extracted font size varies from the original, I had to manually put it.

public static void main(String[] args) throws IOException {

     PdfReader reader = new PdfReader(SOURCE);
     PdfWriter writer = new PdfWriter(DESTINATION);
     PdfDocument pdfDocument = new PdfDocument(reader, writer);

     TextPropertiesExtractionStrategy extractionStrategy = new TextPropertiesExtractionStrategy();
            new PdfCanvasProcessor(extractionStrategy).processPageContent(pdfDocument.getPage(1));
     System.out.println("Font Name: " + extractionStrategy.getFontName());
     System.out.println("Font Size: " + extractionStrategy.getFontSize());
     System.out.println("Text Color: " + extractionStrategy.getTextColor().getColorSpace().toString());

}

private static class TextPropertiesExtractionStrategy implements ITextExtractionStrategy {
        private String fontName;
        private float fontSize;
        private Color textColor;
        private PdfFont font;

        @Override
        public void eventOccurred(IEventData data, EventType type) {
            if (data instanceof TextRenderInfo) {
                TextRenderInfo textRenderInfo = (TextRenderInfo) data;

                // Get font information
                font = textRenderInfo.getFont();
                fontName = font.getFontProgram().getFontNames().getFontName();
                fontSize = textRenderInfo.getFontSize();

                // Get text color information
                textColor = textRenderInfo.getFillColor();
            }
        }

        @Override
        public Set<EventType> getSupportedEvents() {
            return null;
        }

        @Override
        public String getResultantText() {
            return null;
        }

        public String getFontName() {
            return fontName;
        }

        public float getFontSize() {
            return fontSize;
        }

        public PdfFont getFont() {
            return font;
        }

        public Color getTextColor() {
            return textColor;
        }
    }

I am open to any other open source libraries or languages as well like python [I've also tried MuPdf] as long as it solves this particular problem.


Solution

  • Ignoring language the way a PDF is constructed generally does not lend itself to native editing of components. An editor needs to basically replace existing entries and write fresh content. Take the first change where w r it e needs altering to check. That "should be easy" it is not an embedded font and is the same number of binary digits (bytes), what could possibly go wrong ?

    c

    enter image description here

    So the font placements are totally upset by the changed character width of a proportional font thus an edit needs to treat the whole text block as totally new.

    Alright let's change the block styling, and instantly we see why you cannot simply replace letters in styled fonts as they will not be placed at the correct spacing.

    enter image description here

    Java is more straight forward but it is an embedded font so if a letter in not included you will either see a box or a blank.

    enter image description here

    And since it is a PDF texts is not the same as words, so note the line under is totally a different object (there are no underline fonts) and needs a totally separate edit.

    enter image description here

    Thus the most efficient way to alter PDF text is using a heads up GUI word processor where both those disruptions, and others, can be compensated for by human judgments.