swiftparsingpdfquartz-core

How can I get all text from a PDF in Swift?


I have a PDF document and would like to extract all its text. I tried the following:

import Quartz

let url = NSBundle.mainBundle().URLForResource("test", withExtension: "pdf")
let pdf = PDFDocument(URL: url)
print(pdf.string())

It does get the text, however the order of the lines extracted is completely mixed up as compared to opening the PDF in Adobe, Edit Select All, Copy, Paste!

How can I get the same outcome in Swift, as opening the PDF, Select All, Copy/Paste!?


Solution

  • That is unfortunately not possible.
    At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

    PDFs are (generally) a one-way street.
    They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

    Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

    If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

    What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

    Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.