pdfboxwatermark

Remove a sentence in pdf by pdfbox


I do the job watermark of remove. i faced a problem how to remove a sentence in pdf file. I hava an idea that when processing operator(TJ Tj '),i record the ordre of such operator(TJ Tj ' ... showIdx). when the need to be removed sentence was found, i found the order index of operator ,and reprocess content stream,delete them. the [op]<a https://stackoverflow.com/questions/58475104/filter-out-all-text-above-a-certain-font-size-from-pdf>[1] introduce PdfContentStreamEditor,but i can not get help from it.

BT    
Tj   showIdx2
TJ   showIdx2
、
ET

BT
Tj    showIdx3
TJ    showIdx4
、
ET
···
[the case pdf file]  <a https://github.com/zhongguogu/PDFBOX/blob/master/pdf/watermark.pdf >
the content in page header "本报告仅供-中庚基金管理有限公司-中庚报告邮箱使用 p2"

Solution

  • According to Google translate that line says that "this report is only for-Zhong Geng Fund Management Co., Ltd.-Zhong Geng Report Mailbox". This quite likely means that the report indeed was for Zhong Geng eyes only. But let's assume they decided to publish those reports more widely and you have the task of removing that soft restriction.

    You mentioned the PdfContentStreamEditor from this answer.

    Indeed you can use it similar to how it has been used in this answer where a string "[QR]" was to be removed from underneath some QR codes:

    PDDocument document = ...
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
            final StringBuilder recentChars = new StringBuilder();
    
            @Override
            protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
                    throws IOException {
                String string = font.toUnicode(code);
                if (string != null)
                    recentChars.append(string);
    
                super.showGlyph(textRenderingMatrix, font, code, displacement);
            }
    
            @Override
            protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
                String recentText = recentChars.toString();
                recentChars.setLength(0);
                String operatorString = operator.getName();
    
                if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "本报告仅供-中庚基金管理有限公司-中庚报告邮箱使用 p2".equals(recentText))
                {
                    return;
                }
    
                super.write(contentStreamWriter, operator, operands);
            }
    
            final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
        };
        editor.processPage(page);
    }
    document.save("watermark-RemoveByText.pdf");
    

    (RemoveText test testRemoveByText)

    Beware, though, this only works if the text to remove is drawn using one text showing instruction only and that instruction only draws the text to remove.

    If instead the text to replace is drawn using multiple instructions following each other, you have to start collecting instructions as long as you have a potential match instead of dropping them immediately. As soon as the potential match turns out not to be a match after all, you'll have to super.write the collected instructions.

    And if instead the text to replace is only part of what a single instruction draws, you'll have to doctor around with that instruction. Depending on one's script this may be very difficult, depending on how much it uses ligatures and stuff.

    And the most complex situations may require you to collect all instructions while they're coming in, analyzing the whole of them, adapting identified instructions, and then forwarding the manipulated collected instructions to super.write.