I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.
os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'
headers = {
"X-Tika-OCRLanguage": "eng+nor",
"X-Tika-PDFextractInlineImages": "true", # run OCR against inline images
}
data = parser.from_buffer(
buffer.readall(),
xmlContent=True,
requestOptions={
"headers": headers,
"timeout": 3600
}
)
Is there any header I'm missing about to handle large files?
I'm using tika-server running it directly on a docker image with this command:
docker run -d -p 9998:9998 apache/tika:1.28.2-full
Thanks for your time!
I think I've managed to solve the problem. I only needed to change the headers, for the moment it's working:
headers = {
"X-Tika-OCRLanguage": "eng+nor",
"X-Tika-PDFocrStrategy": "auto"
}