rimagemagickocrtesseractxpdf

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"


trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.

My issue is now with getting Tesseract to do anything with my PNG files. I get the error:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.

Here is my code:

lapply(myfiles, function(i){

shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
    lapply(mypngs, function(z){
    shell(shQuote(paste0("tesseract ", z, " out")))
    file.remove(paste0(z))
    })
})

Solution

  • The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.

    I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.

      dest <- "C:\\users\\YOURNAME\\desktop"
    
      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
        lapply(files, function(i){
          shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
          })
    
    
      myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
        lapply(myppms, function(y){
          shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
          file.remove(paste0(y,".ppm"))
          })
    
      mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
        lapply(mytiffs, function(z){
          shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
          file.remove(paste0(z,".tif"))
          })