gopdftext-extraction

How to extract text from pdf using golang?


I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "strings"
    pdf "github.com/unidoc/unipdf/v3/model"
)

func main() {
    fmt.Println("Enter URL of PDF file:")
    reader := bufio.NewReader(os.Stdin)
    url, err := reader.ReadString('\n')
    if err != nil {
        log.Fatal(err)
    }
    url = strings.TrimSpace(url)

    // Fetch PDF from URL.
    resp, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
    buf, _ := ioutil.ReadAll(resp.Body)
    pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
    if err != nil {
        log.Fatal(err)
    }

    // Parse PDF file.
    isEncrypted, err := pdfReader.IsEncrypted()
    if err != nil {
        log.Fatal(err)
    }

    // If PDF is encrypted, exit with message.
    if isEncrypted {
        fmt.Println("Error: PDF is encrypted.")
        os.Exit(1)
    }

    // Get number of pages.
    numPages, err := pdfReader.GetNumPages()
    if err != nil {
        log.Fatal(err)
    }
    // Iterate through pages and print text.
    for i := 1; i <= numPages; i++ {
        page, err := pdfReader.GetPage(i)
        if err != nil {
            log.Fatal(err)
        }
        text, err := page.GetAllContentStreams()
        if err != nil {
            log.Fatal(err)
        }
        fmt.Println(text)
    }
}

Solution

  • It is possible for GetAllContentStreams might returns formats, graphics, images, and other objects in that page and that might be the reason for printing complete garbage(some random numbers).

    GetAllContentStreams gets all the content streams for a page as one string

    Instead of GetAllContentStreams, we can use ExtractText method to extract the text.

    ExtractText processes and extracts all text data in content streams and returns as a string.

    And this should need a licence api key to use the package.

    https://github.com/unidoc/unipdf

    This software package (unipdf) is a commercial product and requires a license code to operate.

    To Get a Metered License API Key in for free in the Free Tier, sign up on https://cloud.unidoc.io

    The unipdf example code can be found at here

    Here is the updated code

    func init() {
        // Make sure to load your metered License API key prior to using the library.
        // If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
        err := license.SetMeteredKey("your-metered-api-key")
        if err != nil {
            panic(err)
        }
    }
    
    func main() {
        //
        // The other blocks in your code
        //
    
        // Iterate through pages and print text.
        for i := 1; i <= numPages; i++ {
            pageNum := i + 1
    
            page, err := pdfReader.GetPage(i)
            if err != nil {
                log.Fatal(err)
            }
            ex, err := extractor.New(page)
            if err != nil {
                log.Fatal(err)
            }
            text, err := ex.ExtractText()
            if err != nil {
                log.Fatal(err)
            }
    
            fmt.Println("------------------------------")
            fmt.Printf("Page %d:\n", pageNum)
            fmt.Printf(text)
            fmt.Println("------------------------------")
        }
    }