gopdfpdftotext

How to extract table data from PDF?


My goal is to process a .pdf file to the memory. The problem is the output ignore the table, which results of concanated string.

The library used: https://github.com/ledongthuc/pdf

The code

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    pdf.DebugOn = true
    content, err := readPdf("accountnumberJul2022.pdf") // Read local pdf file
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    f, r, err := pdf.Open(path)
    // remember close file
    defer f.Close()
    if err != nil {
        return "", err
    }
    var buf bytes.Buffer
    b, err := r.GetPlainText()
    if err != nil {
        return "", err
    }
    buf.ReadFrom(b)
    return buf.String(), nil
}

The PDF file: https://drive.google.com/file/d/14RFll7pZ8_J8ua-NDrw31QHe-4N16IJL/view?usp=sharing

The output

DATEDESCRIPTIONBRENTRYBALANCE01/07Beginning Balance1,000.0002/07TRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSON1 DB999.0003/07TRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON2 DB997.0004/07TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON3 DB994.0005/07TRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXX56LAWSON4 DB990.0006/07TRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSON5 DB985.0007/07TRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSON6 DB979.0008/07TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON7 DB972.0009/07TRSFE-BANKINGDB0907/DBXOA/SB24313/Q0321XXXXX56LAWSON8 DB964.00Continued on next pageDATEDESCRIPTIONBRENTRYBALANCE10/07TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9 DB955.0011/07TRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSON10 DB945.0012/07TRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSON11 DB934.0013/07TRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSON12 DB922.0014/07TRSFE-BANKINGDB1407/DBXOA/SB24313/Q0321XXXXX56LAWSON13 DB909.0015/07TRSFE-BANKINGDB1507/DBXOA/SB24313/Q0321XXXXX56LAWSON14 DB895.0016/07INTEREST1517/07INTERESTTAX1909.00

What I've tried:

  1. I also tried the example for Read text grouped by rows and change the fmt.Println(word.S) to fmt.Print(word.S)

However, the output is more unreadable

>>>> row:  0
ATEDESCRIPTIONBRENTRYBALANCEBe00.ginning Balance1,000.00469NOSWTRSFE-BANKINGDB0207/DBXOA/SB24313/Q0321XXXXX56LAWSONAL659X99.00XXXXXTRSFE-BANKINGDB0307/DBXOA/SB24313/Q0321XXXXX56LAWSON1230997.00Q/313TRSFE-BANKINGDB0407/DBXOA/SB24313/Q0321XXXXX56LAWSON42BS994.00/AOXBTRSFE-BANKINGDB0507/DBXOA/SB24313/Q0321XXXXD56LAWSOND/70990.0090BDGTRSFE-BANKINGDB0607/DBXOA/SB24313/Q0321XXXXX56LAWSONNIKN985.00AB-EFTRSFE-BANKINGDB0707/DBXOA/SB24313/Q0321XXXXX56LAWSONSRT0979.000.279TRSFE-BANKINGDB0807/DBXOA/SB24313/Q0321XXXXX56LAWSON701/70080009/0770270/70/0770/6003/070/507/4/03 DBBD 2B4 DB5D 1 DB68 DB DBBD 7oontinued Cn next page>>>> row:  0
TATEDESCRIPTIONBRENTRYBALANCE00.90TRSFE-BANKINGDB1007/DBXOA/SB24313/Q0321XXXXX56LAWSON9XAT955.00TSERETRSFE-BANKINGDB1107/DBXOA/SB24313/Q0321XXXXX56LAWSONTNITS945.00ERETNTRSFE-BANKINGDB1207/DBXOA/SB24313/Q0321XXXXX56LAWSONI00.5934.0098NO/DSRSFE-BANKINGDB1307/DBXOA/SB24313/Q0321XXXXX56LAWSONWAL65922.00XXXXXTRSFE-BANKINGDB1407DBXOA/SB24313/Q0321XXXXX56LAWSON1230Q909.00/3134TRSFE-BANKINGDB1507/DBXOA/SB20/5110/0770/410707/60170//7137712/107/1141B3 BDBDDBD 1 0112 BD 11D9B 511

Solution

  • PDF files isn't meant to be machine-readable by design. And PDFs and how they are structured can vary quite a bit from one to another. So I doubt there will be any "solid solution" for parsing arbitrary PDF files. PDF files aren't necessary "structured" like your original spreadsheet file where the PDF might come from. It's more like vector graphics in the sense that it only contains positions and commands to plot the characters in the right place instead of containing the texts themselves.

    In your case, your particular PDF file seems somewhat well-structured. Extracting the content using qpdf reveals that:

    # part of the pdf content extracted, comments (#) added by me.
    BT
    /F4 14.666667 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <0027> Tj  # D
    10.5842743 0 Td <0024> Tj   # A
    8.6870575 0 Td <0037> Tj    # T
    8.9526215 0 Td <0028> Tj    # E
    ET
    Q
    Q
    q
    147.75 87.296265 149.25 23.148926 re
    W* n
    q
    .75 0 0 .75 152.25 92.546265 cm
    /G3 gs
    BT
    /F4 14.666667 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <0027> Tj    # D
    10.5842743 0 Td <0028> Tj     # E
    9.7756042 0 Td <0036> Tj      # S
    9.7756042 0 Td <0026> Tj      # C
    10.5842743 0 Td <0035> Tj     # R
    10.5842743 0 Td <002C> Tj     # I
    4.0719757 0 Td <0033> Tj      # P
    9.7756042 0 Td <0037> Tj      # T
    8.9526215 0 Td <002C> Tj      # I
    4.0719757 0 Td <0032> Tj      # O
    11.4001007 0 Td <0031> Tj     # N
    ET
    
    # some part skipped......
    
    BT
    /F4 14.666667 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <0037> Tj    # T 
    8.9526215 0 Td <0035> Tj      # R
    10.5842743 0 Td <0036> Tj     # S
    9.7756042 0 Td <0029> Tj      # F
    ET
    Q
    q
    .75 0 0 .75 152.25 152.993042 cm
    /G3 gs
    BT
    /F4 14.666667 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <0028> Tj    # E
    9.7756042 0 Td <0010> Tj      # -
    4.8806458 0 Td <0025> Tj      # B
    9.7756042 0 Td <0024> Tj      # A
    9.7756042 0 Td <0031> Tj      # N
    10.5842743 0 Td <002E> Tj     # K
    9.7756042 0 Td <002C> Tj      # I
    4.0719757 0 Td <0031> Tj      # N
    10.5842743 0 Td <002A> Tj     # G
    ET
    Q
    q
    .75 0 0 .75 152.25 165.641968 cm
    /G3 gs
    BT
    /F4 14.666667 Tf
    1 0 0 -1 0 .47981739 Tm
    0 -13.2773438 Td <0027> Tj    # D
    10.5842743 0 Td <0025> Tj     # B
    ET
    

    BT = Begin Text
    ET = End Text

    Writing a program similar to your library used https://github.com/ledongthuc/pdf or directly modifying the library to parse anything between one pair of BT and ET as a single text should be trivial. The hard part is recovering the column and row information of the spreadsheet (ie. which text belongs to which field). Since in the eye of the PDF reader, the spreadsheet lines are just a bunch of arbitrary lines and maybe sometimes arbitrary rectangles as well.

    Here's a demo program I wrote that firstly finds all rectangles, then put all texts into their corresponding rectangles, then sort & concat the texts within each fields to form the final result.

    func readPdf(path string) {
        r, err := pdf.Open(path)
        panic(err)
    
        // extract all rectangles
        var fieldRects []FieldRect
        p := r.Page(1)
        c := p.Content()
        // font := p.Font(p.Fonts()[0])
        // fmt.Printf("font.Widths(): %v\n", font.Widths())
        for _, r := range c.Rect {
            fieldRects = append(fieldRects, FieldRect{
                rect:  r,
                texts: nil,
            })
        }
    
        // put text(glyph) into their corresponding rectangles
        for _, t := range c.Text {
            for i := range fieldRects {
                fr := &fieldRects[i]
                if fr.rect.Min.X < t.X && fr.rect.Min.Y < t.Y &&
                    fr.rect.Max.X > t.X && fr.rect.Max.Y > t.Y {
                    fr.texts = append(fr.texts, t)
                }
            }
        }
    
        // these values can also be derived from font size to gain
        // even more robustness
        const NEWLINE_TOLERATION = 2
        // unfortunately the pdf you sent does not have proper font
        // width information, so this is the best we can get without
        // inferring width information from the glyph shape itself.
        const SPACE_TOLERATION = 11
    
        // sort text(glyph) by position within rectangles, then concat
        for i := range fieldRects {
            fr := &fieldRects[i]
            sort.Slice(fr.texts, func(i, j int) bool {
                deltaY := fr.texts[i].Y - fr.texts[j].Y
                if math.Abs(deltaY) < NEWLINE_TOLERATION { // tolerate some vertical deviation
                    return fr.texts[i].X < fr.texts[j].X // on the same line
                }
                return deltaY > 0 // not on the same line
            })
            for _, f := range fr.texts {
                if fr.lastPos != nil {
                    if fr.lastPos.Y-f.Y > NEWLINE_TOLERATION { // new line
                        fr.resultText += "\n"
                    }
                    if f.X-fr.lastPos.X > SPACE_TOLERATION { // space
                        fr.resultText += " "
                    }
                }
                fr.resultText += f.S
                fr.lastPos = &pdf.Point{X: f.X, Y: f.Y}
            }
            if fr.resultText == "" {
                continue
            }
            fmt.Printf("====== pos: %v, %v; text: \n%s\n", fr.rect.Min, fr.rect.Max, fr.resultText)
        }
    }
    

    Due to the lack of font width information in the PDF file you sent, there's no easy way to implement reliable space detection. This program yielded a somewhat readable but not great result:

    ====== pos: {0 0}, {794 1123}; text: 
    DATE DESCRIPTION BR ENTRY BALANCE
    01/07 Beginning Balance 1,000.00
    02/07 TRSF 0207/DBXO 1 DB 999.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    03/07 TRSF 0307/DBXO 2 DB 997.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    04/07 TRSF 0407/DBXO 3 DB 994.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    05/07 TRSF 0507/DBXO 4 DB 990.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    06/07 TRSF 0607/DBXO 5 DB 985.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    07/07 TRSF 0707/DBXO 6 DB 979.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    08/07 TRSF 0807/DBXO 7 DB 972.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
    09/07 TRSF 0907/DBXO 8 DB 964.00
    E-BANKING A/SB24313/
    DB Q0321XXXX
    X56
    LAWSON
     Continued on next page
    ====== pos: {372.75 87.296265}, {447 110.44519100000001}; text: 
    Continue
    ====== pos: {447.75 87.296265}, {522 110.44519100000001}; text: 
    d on next page
    ====== pos: {147.75 111.19519}, {297 134.34411599999999}; text: 
    X56
    LAWSON
    ====== pos: {147.75 135.094116}, {222 183.540893}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 135.094116}, {297 208.83874500000002}; text: 
    X56
    LAWSON
    0907/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 209.58875}, {222 258.035527}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 209.58875}, {297 283.33337900000004}; text: 
    X56
    LAWSON
    0807/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 284.08337}, {222 332.530147}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 284.08337}, {297 357.827999}; text: 
    X56
    LAWSON
    0707/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 358.578}, {222 407.024777}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 358.578}, {297 432.322629}; text: 
    X56
    LAWSON
    0607/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 433.07263}, {222 481.519407}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 433.07263}, {297 506.81725900000004}; text: 
    X56
    LAWSON
    0507/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 507.56726}, {222 556.0140369999999}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 507.56726}, {297 581.311889}; text: 
    X56
    LAWSON
    0407/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 582.06189}, {222 630.508667}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 582.06189}, {297 655.806519}; text: 
    X56
    LAWSON
    0307/DBXO
    A/SB24313/
    Q0321XXXX
    ====== pos: {147.75 656.55652}, {222 705.003297}; text: 
    TRSF
    E-BANKING
    DB
    ====== pos: {222.75 656.55652}, {297 730.301149}; text: 
    nce
    0207/DBXO
    A/SB24313/
    Q0321XXXX
    

    Another reason for this to not work as well is the fact that the rectangles extracted from the document does not match how the cells looks in a PDF reader.

    In fact I believe this could work near perfectly if you could figure out a way to find all the cells and divide the document page into correctly positioned and sized cells. (either manually or automatically). And maybe use the trick of treating anything between a pair of BT and ET as a single piece of text instead of relying on the relative positions between glyphs to do that.

    But bear in mind that even if you managed to do that, this will still only work for this particular format and (for lack of a better word) "flavour" of PDF created by this particular software, and will be unlikely to work well on anything else.

    There is also commercial solutions available, like the framework behind https://www.ilovepdf.com/pdf_to_excel which I find working quite well for your particular document. These solutions tends to be more robust and reliable but do cost you money to purchase. Some of them can be used online though so maybe figuring a way to use their api might be a viable alternative. (if their terms of service allows you to do so, that is.)