want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself. so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred. my pdf link: download
At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer
.
I based code doing this on the PdfContentStreamEditor
originally from this answer like this:
PDDocument document = PDDocument.load(...);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
ByteArrayOutputStream commonRaw = null;
ContentStreamWriter commonWriter = null;
int depth = 0;
@Override
public void processPage(PDPage page) throws IOException {
commonRaw = new ByteArrayOutputStream();
try {
commonWriter = new ContentStreamWriter(commonRaw);
startFigurePage(page);
super.processPage(page);
} finally {
endFigurePage();
commonRaw.close();
}
}
@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (operatorString.equals("BT")) {
endFigurePage();
}
if (operatorString.equals("q")) {
depth++;
}
writeFigure(operator, operands);
if (operatorString.equals("Q")) {
depth--;
}
if (operatorString.equals("ET")) {
startFigurePage(getCurrentPage());
}
super.write(contentStreamWriter, operator, operands);
}
OutputStream figureRaw = null;
ContentStreamWriter figureWriter = null;
PDPage figurePage = null;
int xobjectsDrawn = 0;
int pathsPainted = 0;
void startFigurePage(PDPage currentPage) throws IOException {
figurePage = new PDPage(currentPage.getMediaBox());
figurePage.setResources(currentPage.getResources());
PDStream stream = new PDStream(document);
figurePage.setContents(stream);
figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
figureRaw.write(commonRaw.toByteArray());
xobjectsDrawn = 0;
pathsPainted = 0;
}
void endFigurePage() throws IOException {
if (figureWriter != null) {
figureWriter = null;
figureRaw.close();
figureRaw = null;
if (xobjectsDrawn > 0 || pathsPainted > 3)
document.addPage(figurePage);
figurePage = null;
}
}
final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
"B", "B*", "b", "b*");
void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
if (figureWriter != null) {
String operatorString = operator.getName();
boolean isXObjectDo = operatorString.equals("Do");
boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
if (isXObjectDo)
xobjectsDrawn++;
if (isPathPainting)
pathsPainted++;
figureWriter.writeTokens(operands);
figureWriter.writeToken(operator);
if (depth == 0) {
if (!isXObjectDo) {
if (isPathPainting)
operator = Operator.getOperator("n");
commonWriter.writeTokens(operands);
commonWriter.writeToken(operator);
}
}
}
}
};
editor.processPage(page);
}
document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));
(IsolateFigures test testIsolateInMy
)
The first figures are extracted quite fine:
Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content: