I wanted to ask if any of you have encountered a similar error.
I am working in a company where we are using airflow, deployed on Azure kubernetes.
We have a Dag in charge of extracting some information about different documents. Among many of the things we extract from the documents, we use tika to extract the xml.
The flow would be:
Some facts about the task using tika-server:
This is our task inside Airflow:
text_extraction = KubernetesPodOperator(
task_id="text_extraction",
name="text_extraction",
namespace=DEFAULT_NAMESPACE,
image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
image=image_text_tools,
arguments=[
"tika-text-extract",
"--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
"--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
"--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
"--ocr"
],
get_logs=True,
is_delete_operator_pod=True,
startup_timeout_seconds=300,
volumes=[VOLUME.volume],
volume_mounts=[VOLUME.volume_mount1],
max_active_tis_per_dag=3,
retries=3,
retry_delay=timedelta(minutes=1),
)
I leave the error here, although I don't think it is of much help:
[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return callback(**use_params) # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.
I solved it by simply changing TIKA_STARTUP_MAX_RETRY to 5 because it took longer to start when I had many executions at the same time.