Trying to parse a word document through tika using Tika-Python library (https://github.com/chrismattmann/tika-python) in python2.7 (I know that it is being depreciated, but few other dependencies work only in python2). But for few of the larger documents I am unable to get the parsed data. I am using the below code snippet to parse the document.
headers = {
"X-Tika-OCRLanguage": "eng",
'timeout': 300,
'pool_timeout': 300,
"X-Tika-OCRTimeout": 300
}
text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers':headers})
This code snippet throws following error:
ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='localhost', port=9998): Read timed out. (read timeout=60)",),)
Tried various request options to increase the read timeout but failed. Can anybody please help here?
I found the issue, thanks to the repository owner @chrismattmann who pointed out that the timeout parameter should be outside the header parameter. The above code should look like this to work:
headers = {
"X-Tika-OCRLanguage": "eng",
"X-Tika-OCRTimeout": "300"
}
text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers': headers, 'timeout': 300})