google-apps-scriptpdfgoogle-drive-api

Get text from PDF in Google


I have a PDF document that is saved in Google Drive. I can use the Google Drive Web UI search to find text in the document.

How can I programmatically extract a portion of the text in the document using Google Apps Script?


Solution

  • See pdfToText() in this gist.

    To invoke the OCR built in to Google Drive on a PDF file, e.g. myPDF.pdf, here is what you do:

    function myFunction() {
      var pdfFile = DriveApp.getFilesByName("myPDF.pdf").next();
      var blob = pdfFile.getBlob();
    
      // Get the text from pdf
      var filetext = pdfToText( blob, {keepTextfile: false} );
    
      // Now do whatever you want with filetext...
    }