I'm currently using tika
to extract the text from pdf files. I found a very fast method within the tika
module. This method is called unpack
.
This is my code:
from tika import unpack
text = unpack.from_file('example.pdf')['content']
However, once in a while (not always!) I get this warning:
2018-11-02 15:30:25,533 [MainThread ] [WARNI] Failed to see startup log message; retrying...
After retrying the code starts to work. However, I don't understand the warning and also it takes time to retry. Anyone has an idea why I get this warning?
This is the github page: https://github.com/chrismattmann/tika-python
Tika python is a python binding of the Apache Tika. And the way it binds Apache Tika is interfacing over http using the rest service exposed by Tika. If you run Tika as client only mode, then it simply interfaces with the url provided. Other wise it starts a Apache Tika server locally to interface with it.
Now I am assuming you are not running Tika as client only mode. So basically library will spin a Tika Apache server. And the way it verifies that it has successfully spawned a Tika server is by checking the Tika log file for the presence of message ""Started Apache Tika server at". This verification is done recursively up-to a limit with a time delay. See the source here
Basically the message you are seeing because by the time of first verification, the Apache Tika server is still not started.
I don't think the warning message should be of any consequence here, as the verification happens right after the command to start the server. I am not sure how the library should handle this. May be You could argue that it should may be log it as info. Also increasing the time delay is not going to help here as the verification is done right after the start command.
On a side not I am not sure if the verification handle the older messages, as in if you are calling the unpack twice does the library make sure that the log file from previous run doesn't exists ?