iosswiftimagepdftext

Extracting PDF text using PDFKit and Vision OCR


I have been trying to get this right for a long time now. I have single page PDFs that are passed in with a .fileImporter as a url. These PDFs are very simple structured, typed text in ordered tables:

enter image description here

I need to extract the all text, but mainly what is in the tables, in a structured order. There are so many sites saying to do things like this to get PDF text:

func extractText(from url: URL, appSettings: SettingsStorage) {
        guard let document = PDFDocument(url: url),
              let page = document.page(at: 0) else {
            print("Fail")
            return
        }
        if let structuredText = page.string {
            print(structuredText)
        }
    }

Yes, this extracts the text, but it is by no means structured. And yes I know PDFs don't have a "structure".

I have tried many different versions of the above such as using .attributedString etc etc.

The only way I can think to actually extract the text with structure is to use Vision OCR. This will of course require converting the PDF to an image. This is what I am doing in regard to that:

    func convertPDFToImage(url: URL) -> UIImage? {
        guard let pdfDocument = PDFDocument(url: url) else {
            return nil
        }
        if let pdfPage = pdfDocument.page(at: 0) {
            let pdfPageSize = pdfPage.bounds(for: .mediaBox)
            let renderer = UIGraphicsImageRenderer(size: pdfPageSize.size)
            
            let image = renderer.image { ctx in
                UIColor.lightText.set()
                ctx.fill(pdfPageSize)
                ctx.cgContext.translateBy(x: 0.0, y: pdfPageSize.size.height)
                ctx.cgContext.scaleBy(x: 1.0, y: -1.0)
                
                pdfPage.draw(with: .mediaBox, to: ctx.cgContext)
            }
            return image
        } else {
            return nil
        }
    }

This is giving me an imperfect image. I have tried scaling to improve the quality etc but doesn't fix. Maybe this way is outdated??

I then try extracting the text like so:

    func getConvertPDFAndGetText(url: URL) {
        if let image = convertPDFToImage(url: url) {
            guard let cgImage = image.cgImage else { return }
            let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
            let request = VNRecognizeTextRequest { request, error in
                if let observations = request.results as? [VNRecognizedTextObservation] {
                    let topCandidates: [String] = observations.map { observation in
                        observation.topCandidates(1).first?.string ?? ""
                    }
                    print(topCandidates.joined(separator: " "))
                }
            }
            do {
                try handler.perform([request])
            } catch {
                print("error")
            }
        }
    }

I don't know if my conversion to image is wrong or my text recognition is wrong. Can someone please help me here? This has been countless hours trying to get this to work well enough to put into production.


Solution

  • This task can be accomplished with some serious work on someone's part, to create algorithms to reconstruct the original document structure. Unlike trying to process a random PDF, you have complete knowledge of it:

    - the header containing the fields connected by "|" and the right hand text containing the date

    - the headers for Inbound and Outbound (these fields have titles that do not change)

    - the final header for "Crew on My Pairing"

    Digging into PDFPage, you can find APIs that will be of assistance:

    - characterBoundsAtIndex - returns the page coordinates of a character

    - characdterIndexAtPoint - may need

    - selectionForRect - see "Annotations" below

    - selectionForWordAtPoint - may need

    - attributedString - seems worthless until you deconstruct it - see "AttributedText" below

    I can suggest two approaches to getting the tabular data:

    1. AttributedText

    See Getting Attributes for a Range of Text - there are probably other APIs that are also of use. In this method, you would iterate through all parts of the string, which are most likely one per logical string (that is, one cell in the table). Using the known range of the text, you can get its bounding box in the page.

    First, you find the various headers for which you know the name (from looking at the PDF). The tabular data you want to retrieve will be located under that header title. you would then iterate until there is no more text in what should be the location for the next row. Once you have a second row, you will know the cadence (horizontal space from the first to second line).

    1. Annotations

    Apple support Annotations, ie fillable text fields that are added to a PDF (aka Widgets). I am familiar with a paid for program UPDF that supports adding these, but have read of a free web site that does the same thing.

    For this approach you can add an Annotation that fits over each header field. You would do this on a "reference" PDF - that is, any sample that you have. Since PDFKit lets you get a list of all these annotations on a page, along with their bounding box, you can easily find the first header row. For the second and third headers, you will at least know their horizontal bounding box, and will have to use some PDFPage API to find their vertical location (since you know their text, it should be fairly easy to do this).

    ----

    The bottom line is that knowing the headers, you have sufficient information to detect the various content locations.

    One caveat comes to mind:

    - if the program producing this PDF changes, the algorithms will almost surely need updating