I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strings"
pdf "github.com/unidoc/unipdf/v3/model"
)
func main() {
fmt.Println("Enter URL of PDF file:")
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println("Error: PDF is encrypted.")
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}
It is possible for GetAllContentStreams
might returns formats, graphics, images, and other objects in that page and that might be the reason for printing complete garbage(some random numbers).
GetAllContentStreams gets all the content streams for a page as one string
Instead of GetAllContentStreams
, we can use ExtractText
method to extract the text.
ExtractText processes and extracts all text data in content streams and returns as a string.
And this should need a licence api key to use the package.
https://github.com/unidoc/unipdf
This software package (unipdf) is a commercial product and requires a license code to operate.
To Get a Metered License API Key in for free in the Free Tier, sign up on https://cloud.unidoc.io
The unipdf example code can be found at here
Here is the updated code
func init() {
// Make sure to load your metered License API key prior to using the library.
// If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
err := license.SetMeteredKey("your-metered-api-key")
if err != nil {
panic(err)
}
}
func main() {
//
// The other blocks in your code
//
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
ex, err := extractor.New(page)
if err != nil {
log.Fatal(err)
}
text, err := ex.ExtractText()
if err != nil {
log.Fatal(err)
}
fmt.Println("------------------------------")
fmt.Printf("Page %d:\n", pageNum)
fmt.Printf(text)
fmt.Println("------------------------------")
}
}