javatesseracttess4j

Add OCR layer to existing PDF without the need to write to file system


I'm trying to take a scanned PDF document and add a OCR layer on top. I can get the following code to achieve this:

    public void ocrFile(PDDocument pdDocument, File file) throws TesseractException, IOException {
    PDFTextStripper pdfStripper = new PDFTextStripper();
    String text = pdfStripper.getText(pdDocument);

    Tesseract instance = new Tesseract(); // JNA Interface Mapping
    File tessDataFolder = LoadLibs.extractTessResources("tessdata");
    instance.setDatapath(tessDataFolder.getAbsolutePath());

    List<RenderedFormat> list = new ArrayList<RenderedFormat>();
    list.add(RenderedFormat.PDF);

    String outputFileName = FilenameUtils.removeExtension(file.getAbsolutePath());
    instance.createDocuments(file.getAbsolutePath(), outputFileName, list);

}

This will output the PDF with the OCR layer in place to a specific location on disk. I'm trying to change this so the application does not need to write any files to disk. I'm not sure if this can be done?

Ideally I'd like to change the File input of ocrFile with a MultipartFile and have that be returned from this method, negating the need for involving the file system. Is this achievable?


Solution

  • No, it cannot be done. Tesseract's TessResultRenderer API outputs to physical files, hence the required outputbase input parameter to specify the name of output file.