I have http://laws-lois.justice.gc.ca/PDF/A-8.8.pdf that I'm trying to parse using PDFBox which has 2 columns. I want the text to be parsed so that the columns are separated, however when I run it through PDFBox it does not separate the 2 columns, rather it concatenates lines from both columns together.
I've read https://issues.apache.org/jira/browse/PDFBOX-448 which says that some PDF's don't have article/beads that can be used and so the parsing will always be wrong. I have tried using stripper.setShouldSeparateByBeads(true)
How can I check this if it has beads or not? I haven't found any reading material regarding this concept except for on questions about PDFBox's column parsing.
You can check if beads are in a page with PDPage.getThreadBeads()
. This will return an empty list if there are no thread beads.
Spoiler alert: your document doesn't have any.
An example on how to use them can be found in the DrawPrintTextLocations.java
example in the source code download. Examples of PDF files with beads can be found in the files PDFBOX-3110-003422-p1-beads.pdf
and PDFBOX-3110-poems-beads.pdf
, also in the source code download.
Bonus tip: have a look at the ExtractTextByArea.java
example, this should help you extract your PDF file.