As the title says, I want to filter out all text from a PDF that is above a certain font size. Currently, I am using the PDFBox library but I am open to using any other free library for Java.
My approach was to use a PDFStreamParser to iterate through the tokens. When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen. However, it has become clear to me that this relatively simple approach will not work because the text may be scaled by the current transformation matrix.
Is there a better approach I could be taking, or a way to make my approach work without it getting too complicated?
Your approach
When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen.
is too simple.
On one hand, as you remark yourself,
the text may be scaled by the current transformation matrix.
(Actually not only by the transformation matrix but also by the text matrix!)
Thus, you have to keep track of these matrices.
On the other hand Tf doesn't only set the base font size for the next text drawing instruction seen, it sets it until the size is explicitly changed by some other instruction.
Furthermore, the text font size and the current transformation matrix are part of the graphics state; thus, they are subject to save state and restore state instructions.
To edit a content stream with respect to the current state, therefore, you have to keep track of a lot of information. Fortunately, PDFBox contains classes to do the heavy lifting here, the class hierarchy based on the PDFStreamEngine
, allowing you to concentrate on your task. To have as much information as possible available for editing, the PDFGraphicsStreamEngine
class appears to be a good choice to build upon.
Thus, let's derive PdfContentStreamEditor
from PDFGraphicsStreamEngine
and add some code for generating a replacement content stream.
public class PdfContentStreamEditor extends PDFGraphicsStreamEngine {
public PdfContentStreamEditor(PDDocument document, PDPage page) {
super(page);
this.document = document;
}
/**
* <p>
* This method retrieves the next operation before its registered
* listener is called. The default does nothing.
* </p>
* <p>
* Override this method to retrieve state information from before the
* operation execution.
* </p>
*/
protected void nextOperation(Operator operator, List<COSBase> operands) {
}
/**
* <p>
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions {@link #processOperator(Operator, List)}
* forwards to it.
* </p>
* <p>
* Override this method to achieve some fancy editing effect.
* </p>
*/
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
contentStreamWriter.writeTokens(operands);
contentStreamWriter.writeToken(operator);
}
// stub implementation of PDFGraphicsStreamEngine abstract methods
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }
@Override
public void drawImage(PDImage pdImage) throws IOException { }
@Override
public void clip(int windingRule) throws IOException { }
@Override
public void moveTo(float x, float y) throws IOException { }
@Override
public void lineTo(float x, float y) throws IOException { }
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
@Override
public Point2D getCurrentPoint() throws IOException { return null; }
@Override
public void closePath() throws IOException { }
@Override
public void endPath() throws IOException { }
@Override
public void strokePath() throws IOException { }
@Override
public void fillPath(int windingRule) throws IOException { }
@Override
public void fillAndStrokePath(int windingRule) throws IOException { }
@Override
public void shadingFill(COSName shadingName) throws IOException { }
// PDFStreamEngine overrides to allow editing
@Override
public void processPage(PDPage page) throws IOException {
PDStream stream = new PDStream(document);
replacement = new ContentStreamWriter(replacementStream = stream.createOutputStream(COSName.FLATE_DECODE));
super.processPage(page);
replacementStream.close();
page.setContents(stream);
replacement = null;
replacementStream = null;
}
@Override
public void showForm(PDFormXObject form) throws IOException {
// DON'T descend into XObjects
// super.showForm(form);
}
@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
nextOperation(operator, operands);
super.processOperator(operator, operands);
write(replacement, operator, operands);
}
final PDDocument document;
OutputStream replacementStream = null;
ContentStreamWriter replacement = null;
}
(PdfContentStreamEditor class)
This code overrides processPage
to create a new page content stream and eventually replace the old one with it. And it overrides processOperator
to provide the processed instruction for editing.
For editing one simply overrides write
here. The existing implementation simply writes the instructions as they come while you may change the instructions to write. Overriding nextOperation
allows you to peek at the graphics state before the current instruction is applied to it.
Applying the editor as is,
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page);
identity.processPage(page);
}
document.save(RESULT);
(EditPageContent test testIdentityInput
)
therefore, will create a result PDF with equivalent content streams.
You want to
filter out all text from a PDF that is above a certain font size.
Thus, we have to check in write
whether the current instruction is a text drawing instruction, and if it is, we have to check the current effective font size, i.e. the base font size transformed by the text matrix and the current transformation matrix. If the effective font size is too large, we have to drop the instruction.
This can be done as follows:
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page) {
@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
float fs = getGraphicsState().getTextState().getFontSize();
Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
double transformedFs = transformedFsVector.distance(transformedOrigin);
if (transformedFs > 100)
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
identity.processPage(page);
}
document.save(RESULT);
(EditPageContent test testRemoveBigTextDocument
)
Strictly speaking completely dropping the instruction in question may not suffice; instead, one would have to replace it with an instruction to change the text matrix just like the dropped text drawing instructions would have done. Otherwise the following not-dropped text may be moved. Often, though, this does work as is because the text matrix is newly set for the following different text. So let's keep it simple here.
This PdfContentStreamEditor
only edits the page content stream. From there XObjects and Patterns may be used which are currently not edited by the editor. It should be easy, though, to, after editing the page content stream, recursively iterate of the XObjects and Patterns and edit them in a similar fashion.
This PdfContentStreamEditor
essentially is a port of the PdfContentStreamEditor
for iText 5 (.Net/Java) from this answer and the PdfCanvasEditor
for iText 7 from this answer. The examples for using those editor classes may give some hints on how to use this PdfContentStreamEditor
for PDFBox.
A similar (but less generic) approach has been used previously in the HelloSignManipulator class in this answer.
In the context of this question a bug in the PdfContentStreamEditor
was found which caused some text lines in the example PDF in focus there to be moved.
The background: Some PDF instructions are defined via other ones, e.g. tx ty TD is specified to have the same effect as -ty TL tx ty Td. The corresponding PDFBox OperatorProcessor
implementations for simplicity work by feeding the equivalent instructions back into the stream engine.
The PdfContentStreamEditor
as implemented above in such a case retrieves signals for both the replacement instructions and the original instructions and writes them all back into the result stream. Thus, the effect of those instructions is doubled. E.g. in case of the TD instruction the text insertion point is forwarded two lines instead of one...
Thus, we have to ignore the replacement instructions. For this replace the method processOperator
above by
@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
if (inOperator) {
super.processOperator(operator, operands);
} else {
inOperator = true;
nextOperation(operator, operands);
super.processOperator(operator, operands);
write(replacement, operator, operands);
inOperator = false;
}
}
boolean inOperator = false;