javapdfbox

How can I copy extracted text page by page in a document to a new PDF document with PDFBox?


I want to copy the text of an original PDF document into a new PDF document preserving the formatting of the source text.

I have already done some tests, but the result of copying the text into the new document is not I hoped. Below I show code in the content stream.

for (PDPage page : newDoc.getPages()) {
     PDPageContentStream contentStream = new PDPageContentStream(newDoc, page);
     contentStream.beginText(); 
     for(List<TextLine> row : rowList){
         for(TextLine characters : line){
             contentStream.setFont(characters.getFont(),               characters.getFontSize());          
             contentStream.newLineAtOffset(characters.getxPos(), characters.getyPos());
             contentStream.setLeading(10.5f);
             contentStream.showText(characters.getText());
          }
      }
      contentStream.endText();
      contentStream.close();
} 


Solution

  • Copying the text

    We already discussed your approach in the comments to your question and you eventually asked for a practical example.

    Unfortunately your code is not compilable, let alone runnable, so I had to create somewhat different code:

    void copyText(PDDocument source, int sourcePageNumber, PDDocument target, PDPage targetPage) throws IOException {
        List<TextPosition> allTextPositions = new ArrayList<>();
        PDFTextStripper pdfTextStripper = new PDFTextStripper() {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
                allTextPositions.addAll(textPositions);
                super.writeString(text, textPositions);
            }
        };
        pdfTextStripper.setStartPage(sourcePageNumber + 1);
        pdfTextStripper.setEndPage(sourcePageNumber + 1);
        pdfTextStripper.getText(source);
    
        PDRectangle targetPageCropBox = targetPage.getCropBox();
        float yOffset = targetPageCropBox.getUpperRightY() + targetPageCropBox.getLowerLeftY();
    
        try (PDPageContentStream contentStream = new PDPageContentStream(target, targetPage, AppendMode.APPEND, true, true)) {
            contentStream.beginText();
            float x = 0;
            float y = yOffset;
            for (TextPosition position: allTextPositions) {
                contentStream.setFont(position.getFont(), position.getFontSizeInPt());
                contentStream.newLineAtOffset(position.getX() - x, - (position.getY() - y));
                contentStream.showText(position.getUnicode());
                x = position.getX();
                y = position.getY();
            }
            contentStream.endText();
        }
    }
    

    You can apply it to a full document like this:

    void copyText(PDDocument source, PDDocument target) throws IOException {
        for (int i = 0; i < source.getNumberOfPages(); i++) {
            PDPage sourcePage = source.getPage(i);
            PDPage targetPage = null;
            if (i < target.getNumberOfPages())
                targetPage = target.getPage(i);
            else
                target.addPage(targetPage = new PDPage(sourcePage.getMediaBox()));
            copyText(source, i, target, targetPage);
        }
    }
    

    Applied to some example documents one gets:

    Original Copy
    input.pdf input-TextCopy.pdf
    TemplateTank.pdf TemplateTank-TextCopy.pdf
    test_document_signed.pdf test_document_signed-TextCopy.pdf

    As is to be expected, "text" that actually is drawn as bitmap image, is not copied.

    Also beware, this is just a proof of concept and not a complete implementation. In particular page rotation and non-upright text in general are not supported. Also the only supported style attributes are text font and text size, other details (e.g. text color) are ignored. Different page geometries in source and target also will result in weird appearances.

    Manipulating the text

    In a comment you asked

    If I wanted to replace some words in the source document with others in the target document and then format it, how could I modify the code?

    To replace some glyphs while keeping everything else in place, is fairly easy. The TextPosition instances in allTextPositions are sorted the same way as the normal text output of the PdfTextStripper is. To find certain words, therefore, you simply can search allTextPositions for sequences of TextPosition instances whose texts.

    To allow for this, I extended the above methods to additionally accept a Consumer that is called between retrieval and drawing:

    void copyText(PDDocument source, PDDocument target, Consumer<List<TextPosition>> updater) throws IOException {
        for (int i = 0; i < source.getNumberOfPages(); i++) {
            PDPage sourcePage = source.getPage(i);
            PDPage targetPage = null;
            if (i < target.getNumberOfPages())
                targetPage = target.getPage(i);
            else
                target.addPage(targetPage = new PDPage(sourcePage.getMediaBox()));
            copyText(source, i, target, targetPage, updater);
        }
    }
    
    void copyText(PDDocument source, int sourcePageNumber, PDDocument target, PDPage targetPage, Consumer<List<TextPosition>> updater) throws IOException {
        List<TextPosition> allTextPositions = new ArrayList<>();
        PDFTextStripper pdfTextStripper = new PDFTextStripper() {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
                allTextPositions.addAll(textPositions);
                super.writeString(text, textPositions);
            }
        };
        pdfTextStripper.setStartPage(sourcePageNumber + 1);
        pdfTextStripper.setEndPage(sourcePageNumber + 1);
        pdfTextStripper.getText(source);
    
        if (updater != null)
            updater.accept(allTextPositions);
    
        PDRectangle targetPageCropBox = targetPage.getCropBox();
        float yOffset = targetPageCropBox.getUpperRightY() + targetPageCropBox.getLowerLeftY();
    
        try (PDPageContentStream contentStream = new PDPageContentStream(target, targetPage, AppendMode.APPEND, true, true)) {
            contentStream.beginText();
            float x = 0;
            float y = yOffset;
            for (TextPosition position: allTextPositions) {
                contentStream.setFont(position.getFont(), position.getFontSizeInPt());
                contentStream.newLineAtOffset(position.getX() - x, - (position.getY() - y));
                contentStream.showText(position.getUnicode());
                x = position.getX();
                y = position.getY();
            }
            contentStream.endText();
        }
    }
    

    (CopyFormattedPageText methods)

    Now there are different strategies for replacing the glyphs. I've implemented two simple ones.

    The first strategy replaces a search word in the list of TextPosition objects by replacing the letters in each instance by the same number of letters from the replacement word (as long as available). This is appropriate if the word is especially formatted (e.g. spaced out) and this special formatting shall be kept.

    void searchAndReplace(List<TextPosition> textPositions, String searchWord, String replacement) {
        if (searchWord == null || searchWord.length() == 0)
            return;
    
        int candidatePosition = 0;
        String candidate = "";
        for (int i = 0; i < textPositions.size(); i++) {
            candidate += textPositions.get(i).getUnicode();
            if (!searchWord.startsWith(candidate)) {
                candidate = "";
                candidatePosition = i+1;
            } else if (searchWord.length() == candidate.length()) {
                for (int j = 0; j < searchWord.length();) {
                    TextPosition textPosition = textPositions.get(candidatePosition);
                    int length = textPosition.getUnicode().length();
                    String replacementHere = "";
                    if (length > 0 && j < replacement.length()) {
                        int end = j + length;
                        if (end > replacement.length())
                            end = replacement.length();
                        replacementHere = replacement.substring(j, end);
                    }
                    TextPosition newTextPosition = new TextPosition(textPosition.getRotation(),
                            textPosition.getPageWidth(), textPosition.getPageHeight(), textPosition.getTextMatrix(),
                            textPosition.getEndX(), textPosition.getEndY(), textPosition.getHeight(),
                            textPosition.getIndividualWidths()[0], textPosition.getWidthOfSpace(),
                            replacementHere,
                            textPosition.getCharacterCodes(), textPosition.getFont(),
                            textPosition.getFontSize(), (int) textPosition.getFontSizeInPt());
                    textPositions.set(candidatePosition, newTextPosition);
                    candidatePosition++;
                    j += length;
                }
            }
        }
    }
    

    (CopyFormattedPageText method)

    The second strategy replaces a search word in the list of TextPosition objects by replacing the letters in the first instance by the whole replacement word and removing the other instances. This is appropriate if the word is not specially formatted (e.g. spaced out) and shall be printed naturally.

    void searchAndReplaceAlternative(List<TextPosition> textPositions, String searchWord, String replacement) {
        if (searchWord == null || searchWord.length() == 0)
            return;
    
        int candidatePosition = 0;
        String candidate = "";
        for (int i = 0; i < textPositions.size(); i++) {
            candidate += textPositions.get(i).getUnicode();
            if (!searchWord.startsWith(candidate)) {
                candidate = "";
                candidatePosition = i+1;
            } else if (searchWord.length() == candidate.length()) {
                TextPosition textPosition = textPositions.get(candidatePosition);
                TextPosition newTextPosition = new TextPosition(textPosition.getRotation(),
                        textPosition.getPageWidth(), textPosition.getPageHeight(), textPosition.getTextMatrix(),
                        textPosition.getEndX(), textPosition.getEndY(), textPosition.getHeight(),
                        textPosition.getIndividualWidths()[0], textPosition.getWidthOfSpace(),
                        replacement,
                        textPosition.getCharacterCodes(), textPosition.getFont(),
                        textPosition.getFontSize(), (int) textPosition.getFontSizeInPt());
                textPositions.set(candidatePosition, newTextPosition);
    
                while (i > candidatePosition) {
                    textPositions.remove(i--);
                }
                candidatePosition++;
            }
        }
    }
    

    (CopyFormattedPageText method)

    You use these strategies like this in your copyText calls:

    copyText(source, target, list -> searchAndReplace(list, "Test", "Art"));
    ...
    copyText(source, target, list -> searchAndReplaceAlternative(list, "DOCUMENT", "COSTUME"));
    

    (CopyFormattedPageText test methods)

    Beware, though, if the fonts used are subset-embedded, the glyphs for the replacement text may not exist in that font. In that case create and use another font that does include the replacement glyphs4. Also the replacement should be as long.


    As you did not mention a specific PDFBox version, I used the current 3.0.1.