javapdfbox

Filter out all text above a certain font size from PDF


As the title says, I want to filter out all text from a PDF that is above a certain font size. Currently, I am using the PDFBox library but I am open to using any other free library for Java.

My approach was to use a PDFStreamParser to iterate through the tokens. When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen. However, it has become clear to me that this relatively simple approach will not work because the text may be scaled by the current transformation matrix.

Is there a better approach I could be taking, or a way to make my approach work without it getting too complicated?


Solution

  • Your approach

    When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen.

    is too simple.

    On one hand, as you remark yourself,

    the text may be scaled by the current transformation matrix.

    (Actually not only by the transformation matrix but also by the text matrix!)

    Thus, you have to keep track of these matrices.

    On the other hand Tf doesn't only set the base font size for the next text drawing instruction seen, it sets it until the size is explicitly changed by some other instruction.

    Furthermore, the text font size and the current transformation matrix are part of the graphics state; thus, they are subject to save state and restore state instructions.

    To edit a content stream with respect to the current state, therefore, you have to keep track of a lot of information. Fortunately, PDFBox contains classes to do the heavy lifting here, the class hierarchy based on the PDFStreamEngine, allowing you to concentrate on your task. To have as much information as possible available for editing, the PDFGraphicsStreamEngine class appears to be a good choice to build upon.

    A generic content stream editor class

    Thus, let's derive PdfContentStreamEditor from PDFGraphicsStreamEngine and add some code for generating a replacement content stream.

    public class PdfContentStreamEditor extends PDFGraphicsStreamEngine {
        public PdfContentStreamEditor(PDDocument document, PDPage page) {
            super(page);
            this.document = document;
        }
    
        /**
         * <p>
         * This method retrieves the next operation before its registered
         * listener is called. The default does nothing.
         * </p>
         * <p>
         * Override this method to retrieve state information from before the
         * operation execution.
         * </p> 
         */
        protected void nextOperation(Operator operator, List<COSBase> operands) {
            
        }
    
        /**
         * <p>
         * This method writes content stream operations to the target canvas. The default
         * implementation writes them as they come, so it essentially generates identical
         * copies of the original instructions {@link #processOperator(Operator, List)}
         * forwards to it.
         * </p>
         * <p>
         * Override this method to achieve some fancy editing effect.
         * </p> 
         */
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            contentStreamWriter.writeTokens(operands);
            contentStreamWriter.writeToken(operator);
        }
    
        // stub implementation of PDFGraphicsStreamEngine abstract methods
        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }
    
        @Override
        public void drawImage(PDImage pdImage) throws IOException { }
    
        @Override
        public void clip(int windingRule) throws IOException { }
    
        @Override
        public void moveTo(float x, float y) throws IOException { }
    
        @Override
        public void lineTo(float x, float y) throws IOException { }
    
        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
    
        @Override
        public Point2D getCurrentPoint() throws IOException { return null; }
    
        @Override
        public void closePath() throws IOException { }
    
        @Override
        public void endPath() throws IOException { }
    
        @Override
        public void strokePath() throws IOException { }
    
        @Override
        public void fillPath(int windingRule) throws IOException { }
    
        @Override
        public void fillAndStrokePath(int windingRule) throws IOException { }
    
        @Override
        public void shadingFill(COSName shadingName) throws IOException { }
    
        // PDFStreamEngine overrides to allow editing
        @Override
        public void processPage(PDPage page) throws IOException {
            PDStream stream = new PDStream(document);
            replacement = new ContentStreamWriter(replacementStream = stream.createOutputStream(COSName.FLATE_DECODE));
            super.processPage(page);
            replacementStream.close();
            page.setContents(stream);
            replacement = null;
            replacementStream = null;
        }
    
        @Override
        public void showForm(PDFormXObject form) throws IOException {
            // DON'T descend into XObjects
            // super.showForm(form);
        }
    
        @Override
        protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
            nextOperation(operator, operands);
            super.processOperator(operator, operands);
            write(replacement, operator, operands);
        }
    
        final PDDocument document;
        OutputStream replacementStream = null;
        ContentStreamWriter replacement = null;
    }
    

    (PdfContentStreamEditor class)

    This code overrides processPage to create a new page content stream and eventually replace the old one with it. And it overrides processOperator to provide the processed instruction for editing.

    For editing one simply overrides write here. The existing implementation simply writes the instructions as they come while you may change the instructions to write. Overriding nextOperation allows you to peek at the graphics state before the current instruction is applied to it.

    Applying the editor as is,

    PDDocument document = PDDocument.load(SOURCE);
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page);
        identity.processPage(page);
    }
    document.save(RESULT);
    

    (EditPageContent test testIdentityInput)

    therefore, will create a result PDF with equivalent content streams.

    Customizing the content stream editor for your use case

    You want to

    filter out all text from a PDF that is above a certain font size.

    Thus, we have to check in write whether the current instruction is a text drawing instruction, and if it is, we have to check the current effective font size, i.e. the base font size transformed by the text matrix and the current transformation matrix. If the effective font size is too large, we have to drop the instruction.

    This can be done as follows:

    PDDocument document = PDDocument.load(SOURCE);
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page) {
            @Override
            protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
                String operatorString = operator.getName();
    
                if (TEXT_SHOWING_OPERATORS.contains(operatorString))
                {
                    float fs = getGraphicsState().getTextState().getFontSize();
                    Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                    Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
                    Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
                    double transformedFs = transformedFsVector.distance(transformedOrigin);
                    if (transformedFs > 100)
                        return;
                }
    
                super.write(contentStreamWriter, operator, operands);
            }
    
            final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
        };
        identity.processPage(page);
    }
    document.save(RESULT);
    

    (EditPageContent test testRemoveBigTextDocument)

    Strictly speaking completely dropping the instruction in question may not suffice; instead, one would have to replace it with an instruction to change the text matrix just like the dropped text drawing instructions would have done. Otherwise the following not-dropped text may be moved. Often, though, this does work as is because the text matrix is newly set for the following different text. So let's keep it simple here.

    Constraints and remarks

    This PdfContentStreamEditor only edits the page content stream. From there XObjects and Patterns may be used which are currently not edited by the editor. It should be easy, though, to, after editing the page content stream, recursively iterate of the XObjects and Patterns and edit them in a similar fashion.

    This PdfContentStreamEditor essentially is a port of the PdfContentStreamEditor for iText 5 (.Net/Java) from this answer and the PdfCanvasEditor for iText 7 from this answer. The examples for using those editor classes may give some hints on how to use this PdfContentStreamEditor for PDFBox.

    A similar (but less generic) approach has been used previously in the HelloSignManipulator class in this answer.

    Fixing a bug

    In the context of this question a bug in the PdfContentStreamEditor was found which caused some text lines in the example PDF in focus there to be moved.

    The background: Some PDF instructions are defined via other ones, e.g. tx ty TD is specified to have the same effect as -ty TL tx ty Td. The corresponding PDFBox OperatorProcessor implementations for simplicity work by feeding the equivalent instructions back into the stream engine.

    The PdfContentStreamEditor as implemented above in such a case retrieves signals for both the replacement instructions and the original instructions and writes them all back into the result stream. Thus, the effect of those instructions is doubled. E.g. in case of the TD instruction the text insertion point is forwarded two lines instead of one...

    Thus, we have to ignore the replacement instructions. For this replace the method processOperator above by

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        if (inOperator) {
            super.processOperator(operator, operands);
        } else {
            inOperator = true;
            nextOperation(operator, operands);
            super.processOperator(operator, operands);
            write(replacement, operator, operands);
            inOperator = false;
        }
    }
    
    boolean inOperator = false;