pythonapache-tika

Tika-Python library throws read timeout error for large word document


Trying to parse a word document through tika using Tika-Python library (https://github.com/chrismattmann/tika-python) in python2.7 (I know that it is being depreciated, but few other dependencies work only in python2). But for few of the larger documents I am unable to get the parsed data. I am using the below code snippet to parse the document.

headers = {
                "X-Tika-OCRLanguage": "eng",
                'timeout': 300,
                'pool_timeout':  300,
                "X-Tika-OCRTimeout": 300
            }
text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers':headers})

This code snippet throws following error:

ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='localhost', port=9998): Read timed out. (read timeout=60)",),)

Tried various request options to increase the read timeout but failed. Can anybody please help here?


Solution

  • I found the issue, thanks to the repository owner @chrismattmann who pointed out that the timeout parameter should be outside the header parameter. The above code should look like this to work:

    headers = {
                "X-Tika-OCRLanguage": "eng",
                "X-Tika-OCRTimeout": "300"
            }
    text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers': headers, 'timeout': 300})