I'm trying to take a scanned PDF document and add a OCR layer on top. I can get the following code to achieve this:
public void ocrFile(PDDocument pdDocument, File file) throws TesseractException, IOException {
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(pdDocument);
Tesseract instance = new Tesseract(); // JNA Interface Mapping
File tessDataFolder = LoadLibs.extractTessResources("tessdata");
instance.setDatapath(tessDataFolder.getAbsolutePath());
List<RenderedFormat> list = new ArrayList<RenderedFormat>();
list.add(RenderedFormat.PDF);
String outputFileName = FilenameUtils.removeExtension(file.getAbsolutePath());
instance.createDocuments(file.getAbsolutePath(), outputFileName, list);
}
This will output the PDF with the OCR layer in place to a specific location on disk. I'm trying to change this so the application does not need to write any files to disk. I'm not sure if this can be done?
Ideally I'd like to change the File input of ocrFile with a MultipartFile and have that be returned from this method, negating the need for involving the file system. Is this achievable?
No, it cannot be done. Tesseract's TessResultRenderer
API outputs to physical files, hence the required outputbase
input parameter to specify the name of output file.