Black shapes are text that need to be extracted:
So far, i've extracted the text from columns, but manually, because there are only 5 (using the Rectangle class for the regions). My question is: is there a way to do so for rows since the size (height) of the Rectangles are different and manually doing it to 50+ rows would be an atrocity? More specific, can i change the rectangle according to every row's height using a function? Or any suggestion that may help?
As proposed in comments, you can automatically recognize the table cell regions of your example PDF by parsing the vector graphics instructions of the page.
For such a task you can extend the PDFBox PDFGraphicsStreamEngine
which provides abstract methods called for path building and drawing instructions.
Beware: The stream engine class I show here is specialized on recognizing table cell frame lines drawn as long, small rectangles filled with black as used in your example document. For a general solution you should at least also recognize frame lines drawn as vector graphics line segments or as stroked rectangles.
PdfBoxFinder
This stream engine class collects the y coordinate ranges of horizontal lines and the x coordinate ranges of vertical lines and afterwards provides the boxes of the grid defined by these coordinate ranges. In particular this means that row spans or column spans are not supported; in the case at hand this is ok as there are no such spans.
public class PdfBoxFinder extends PDFGraphicsStreamEngine {
/**
* Supply the page to analyze here; to analyze multiple pages
* create multiple {@link PdfBoxFinder} instances.
*/
public PdfBoxFinder(PDPage page) {
super(page);
}
/**
* The boxes ({@link Rectangle2D} instances with coordinates according to
* the PDF coordinate system, e.g. for decorating the table cells) the
* {@link PdfBoxFinder} has recognized on the current page.
*/
public Map<String, Rectangle2D> getBoxes() {
consolidateLists();
Map<String, Rectangle2D> result = new HashMap<>();
if (!horizontalLines.isEmpty() && !verticalLines.isEmpty())
{
Interval top = horizontalLines.get(horizontalLines.size() - 1);
char rowLetter = 'A';
for (int i = horizontalLines.size() - 2; i >= 0; i--, rowLetter++) {
Interval bottom = horizontalLines.get(i);
Interval left = verticalLines.get(0);
int column = 1;
for (int j = 1; j < verticalLines.size(); j++, column++) {
Interval right = verticalLines.get(j);
String name = String.format("%s%s", rowLetter, column);
Rectangle2D rectangle = new Rectangle2D.Float(left.from, bottom.from, right.to - left.from, top.to - bottom.from);
result.put(name, rectangle);
left = right;
}
top = bottom;
}
}
return result;
}
/**
* The regions ({@link Rectangle2D} instances with coordinates according
* to the PDFBox text extraction API, e.g. for initializing the regions of
* a {@link PDFTextStripperByArea}) the {@link PdfBoxFinder} has recognized
* on the current page.
*/
public Map<String, Rectangle2D> getRegions() {
PDRectangle cropBox = getPage().getCropBox();
float xOffset = cropBox.getLowerLeftX();
float yOffset = cropBox.getUpperRightY();
Map<String, Rectangle2D> result = getBoxes();
for (Map.Entry<String, Rectangle2D> entry : result.entrySet()) {
Rectangle2D box = entry.getValue();
Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());
entry.setValue(region);
}
return result;
}
/**
* <p>
* Processes the path elements currently in the {@link #path} list and
* eventually clears the list.
* </p>
* <p>
* Currently only elements are considered which
* </p>
* <ul>
* <li>are {@link Rectangle} instances;
* <li>are filled fairly black;
* <li>have a thin and long form; and
* <li>have sides fairly parallel to the coordinate axis.
* </ul>
*/
void processPath() throws IOException {
PDColor color = getGraphicsState().getNonStrokingColor();
if (!isBlack(color)) {
logger.debug("Dropped path due to non-black fill-color.");
return;
}
for (PathElement pathElement : path) {
if (pathElement instanceof Rectangle) {
Rectangle rectangle = (Rectangle) pathElement;
double p0p1 = rectangle.p0.distance(rectangle.p1);
double p1p2 = rectangle.p1.distance(rectangle.p2);
boolean p0p1small = p0p1 < 3;
boolean p1p2small = p1p2 < 3;
if (p0p1small) {
if (p1p2small) {
logger.debug("Dropped rectangle too small on both sides.");
} else {
processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);
}
} else if (p1p2small) {
processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);
} else {
logger.debug("Dropped rectangle too large on both sides.");
}
}
}
path.clear();
}
/**
* The argument points shall be sorted to have (p0, p1) and (p2, p3) be the small
* edges and (p1, p2) and (p3, p0) the long ones.
*/
void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
float longXDiff = (float)Math.abs(p2.getX() - p1.getX());
float longYDiff = (float)Math.abs(p2.getY() - p1.getY());
boolean longXDiffSmall = longXDiff * 10 < longYDiff;
boolean longYDiffSmall = longYDiff * 10 < longXDiff;
if (longXDiffSmall) {
verticalLines.add(new Interval(p0.getX(), p1.getX(), p2.getX(), p3.getX()));
} else if (longYDiffSmall) {
horizontalLines.add(new Interval(p0.getY(), p1.getY(), p2.getY(), p3.getY()));
} else {
logger.debug("Dropped rectangle too askew.");
}
}
/**
* Sorts the {@link #horizontalLines} and {@link #verticalLines} lists and
* merges fairly identical entries.
*/
void consolidateLists() {
for (List<Interval> intervals : Arrays.asList(horizontalLines, verticalLines)) {
intervals.sort(null);
for (int i = 1; i < intervals.size();) {
if (intervals.get(i-1).combinableWith(intervals.get(i))) {
Interval interval = intervals.get(i-1).combineWith(intervals.get(i));
intervals.set(i-1, interval);
intervals.remove(i);
} else {
i++;
}
}
}
}
/**
* Checks whether the given color is black'ish.
*/
boolean isBlack(PDColor color) throws IOException {
int value = color.toRGB();
for (int i = 0; i < 2; i++) {
int component = value & 0xff;
if (component > 5)
return false;
value /= 256;
}
return true;
}
//
// PDFGraphicsStreamEngine overrides
//
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
path.add(new Rectangle(p0, p1, p2, p3));
}
@Override
public void endPath() throws IOException {
path.clear();
}
@Override
public void strokePath() throws IOException {
path.clear();
}
@Override
public void fillPath(int windingRule) throws IOException {
processPath();
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException {
processPath();
}
@Override public void drawImage(PDImage pdImage) throws IOException { }
@Override public void clip(int windingRule) throws IOException { }
@Override public void moveTo(float x, float y) throws IOException { }
@Override public void lineTo(float x, float y) throws IOException { }
@Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
@Override public Point2D getCurrentPoint() throws IOException { return null; }
@Override public void closePath() throws IOException { }
@Override public void shadingFill(COSName shadingName) throws IOException { }
//
// inner classes
//
class Interval implements Comparable<Interval> {
final float from;
final float to;
Interval(float... values) {
Arrays.sort(values);
this.from = values[0];
this.to = values[values.length - 1];
}
Interval(double... values) {
Arrays.sort(values);
this.from = (float) values[0];
this.to = (float) values[values.length - 1];
}
boolean combinableWith(Interval other) {
if (this.from > other.from)
return other.combinableWith(this);
if (this.to < other.from)
return false;
float intersectionLength = Math.min(this.to, other.to) - other.from;
float thisLength = this.to - this.from;
float otherLength = other.to - other.from;
return (intersectionLength >= thisLength * .9f) || (intersectionLength >= otherLength * .9f);
}
Interval combineWith(Interval other) {
return new Interval(this.from, this.to, other.from, other.to);
}
@Override
public int compareTo(Interval o) {
return this.from == o.from ? Float.compare(this.to, o.to) : Float.compare(this.from, o.from);
}
@Override
public String toString() {
return String.format("[%3.2f, %3.2f]", from, to);
}
}
interface PathElement {
}
class Rectangle implements PathElement {
final Point2D p0, p1, p2, p3;
Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
}
//
// members
//
final List<PathElement> path = new ArrayList<>();
final List<Interval> horizontalLines = new ArrayList<>();
final List<Interval> verticalLines = new ArrayList<>();
final Logger logger = LoggerFactory.getLogger(PdfBoxFinder.class);
}
You can use the PdfBoxFinder
like this to extract text from the table cells of the sample document located at FILE_PATH
:
try ( PDDocument document = PDDocument.load(FILE_PATH) ) {
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfBoxFinder boxFinder = new PdfBoxFinder(page);
boxFinder.processPage(page);
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
for (Map.Entry<String, Rectangle2D> entry : boxFinder.getRegions().entrySet()) {
stripperByArea.addRegion(entry.getKey(), entry.getValue());
}
stripperByArea.extractRegions(page);
List<String> names = stripperByArea.getRegions();
names.sort(null);
for (String name : names) {
System.out.printf("[%s] %s\n", name, stripperByArea.getTextForRegion(name));
}
}
}
(ExtractBoxedText test testExtractBoxedTexts
)
The start of the output:
[A1] Nr.
crt.
[A2] Nume şi prenume
[A3] Titlul lucrării
[A4] Coordonator ştiinţific
[A5] Ora
[B1] 1.
[B2] SFETCU I. JESSICA-
LARISA
[B3] Analiza fluxurilor de date twitter
[B4] Conf. univ. dr. Frîncu Marc
Eduard
[B5] 8:00
[C1] 2.
[C2] TARBA V. IONUȚ-
ADRIAN
[C3] Test me - rest api folosind java şi
play framework
[C4] Conf.univ.dr. Fortiş Teodor
Florin
[C5] 8:12
The first page of the document: