apache-sparkpysparkairflowairflow-2.x

Airflow 2.6.1 set log-level for specific modules to WARN does not work


I'm running Spark 3.4.1 tasks scheduled by Airflow 2.6.1. with a SparkSubmit Operator. Spark is running on Cluster Mode and therefore I don't have explicit logs from Spark Driver. Instead i have updates from spark_submit.py job which polls spark driver pod whether job is finished or not.

The Airflow logs is full of entries like the follows: [2024-07-29, 06:47:33 UTC] {spark_submit.py:523} INFO - 24/07/29 08:47:33 INFO LoggingPodStatusWatcherImpl: Application status for spark-dc8c170895df4383be2c6933606ee764 (phase: Running)

I would like to get rid off this INFO log entries of spark_submit.py module only (Python Modul: from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator) and I found following on apache airflow website: For airflow 2.6.1: https://airflow.apache.org/docs/apache-airflow/2.6.1/administration-and-deployment/logging-monitoring/logging-tasks.html#advanced-configuration For airflow 2.9.3: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/advanced-logging-configuration.html# And examples: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/advanced-logging-configuration.html#custom-logger-for-operators-hooks-and-tasks

Thanks!

I tried to apply this to my setup: Created log_config for SparkSubmit Operator:

from copy import deepcopy
from pydantic.utils import deep_update
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG

LOGGING_CONFIG = deep_update(
    deepcopy(DEFAULT_LOGGING_CONFIG),
    {
        "loggers": {
            "airflow.providers.apache.spark.operators.spark_submit": {
                "handlers": ["task"],
                "level": "WARNING",
                "propagate": True,
            },
        }
    },
)

Then I added following line in airflow.cfg:

...
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# Example: logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class = log_conf.LOGGING_CONFIG
...

The files are stored as follows: airflow.cfg: /opt/airflow log_conf.py: /opt/airflow/config

I restarted the whole airflow application (airflow scheduler, airflow ui and postgres db running inside individual containers within a kubernetes pod) and i saw following log line:

[2024-07-29T09:24:45.193+0200] {logging_config.py:47} INFO - Successfully imported user-defined logging config from log_config.LOGGING_CONFIG

However, the INFO log level for spark_submit.py still appear even though i have changed the files above and restarted the whole airflow application

My questions:

  1. Why does the INFO log level for spark_submit.py still appear in airflow logs?
  2. Is content of log.config.py even compatible with airflow 2.6.1? The background of this question is that on https://airflow.apache.org/docs/apache-airflow/2.6.1/administration-and-deployment/logging-monitoring/logging-tasks.html#advanced-configuration no examples were given.

Solution

  • I think the module to configure is airflow.providers.apache.spark.hooks.spark_submit and not airflow.providers.apache.spark.operators.spark_submit.

    This is going by the fact that this test checks that the hook logs a message containing 'LoggingPodStatusWatcherImpl'.