pythondockerloggingpyspark

PySpark where to find logs and how to log properly


I have set up a local (docker) spark cluster as provided by bitnami. Everything is running fine. I can submit simple spark jobs and run them and i can interact with pyspark.

Now i try to set up basic logging (my sources are stackoverflow and github gist) Mainly i see 2 ways first:

log4jref = sc._jvm.org.apache.log4j
logger = log4jref.getLogger(__name__) 

If i am using this in pyspark console i see at least the output in console:

>>> logger.info("something")
24/04/15 14:24:31 INFO __main__: something

If i am using the "cleaner solution":

import logging
logger = logging.getLogger(__name__)
logger.info("here we go")

i do not see anything returned in pyspark console.

In both cases i do not see my outputs in docker logs spark-spark-worker-1 nor on master docker logs spark-spark-1 (other outputs are there). I also tried to look in /opt/bitnami/spark/logs

So my question is: where to find my log messages in general? And what of the 2 ways is better? In a first steps i just need a quick way to get feedback from my running code - making it production ready comes later (if not easily possible right from start)

little update

the job that i am running for test looks like this:

# Create a SparkSession
spark = SparkSession.builder \
   .appName("My App") \
   .getOrCreate()

log = logging.getLogger(__name__)

rdd = spark.sparkContext.parallelize(range(1, 100))
log.error("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
log.error(f"THE SUM IS HERE: {rdd.sum()}")
print(f"THE SUM IS HERE: {rdd.sum()}")
print("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
# Stop the SparkSession
spark.stop()

While i see in logs that this job runs successfully i can't see my log entries nor the print. All this places seem to show the same log:


Solution

  • If you define a logger in this way you get two logging systems.

    Scenario 1: two logging systems

    Assume the PySpark script is called myApp.py and gets submitted with spark-submit myApp.py:

    from pyspark.sql import SparkSession
    import logging
    
    # Create a SparkSession
    spark = SparkSession.builder \
       .appName("My App") \
       .getOrCreate()
    
    
    # define your logger
    logging.basicConfig(level=logging.WARN)
    logger = logging.getLogger(__name__)
    
    #sc = spark.sparkContext
    #sc.setLogLevel("WARN")
    
    rdd = spark.sparkContext.parallelize(range(1, 100))
    logger.error("MY LOGGER THE SUM IS HERE: %s", rdd.sum())
    print(f"THE SUM IS HERE: {rdd.sum()}")
    
    # Stop the SparkSession
    spark.stop()
    

    Then you get log messages from two logging systems:

    The script's log message will appear as

    ERROR:__main__:MY LOGGER THE SUM IS HERE: 4950

    Scenario 2: unified logging

    If you want to use a single, unified logging system in your PySpark applications modify the script like this:

    from pyspark.sql import SparkSession
    
    # Create a SparkSession
    spark = SparkSession.builder \
       .appName("My App") \
       .getOrCreate()
    
    sc = spark.sparkContext
    log4jLogger = sc._jvm.org.apache.log4j
    logger = log4jLogger.LogManager.getLogger(__name__)
    logger.setLevel(log4jLogger.Level.WARN)  # Set desired logging level
    
    #sc = spark.sparkContext
    #sc.setLogLevel("WARN")
    
    rdd = spark.sparkContext.parallelize(range(1, 100))
    logger.error(f"LOG THE SUM IS HERE: {rdd.sum()}")
    print(f"THE SUM IS HERE: {rdd.sum()}")
    
    # Stop the SparkSession
    spark.stop()
    

    Now your log messages look like

    25/01/02 09:01:30 ERROR __main__: LOG THE SUM IS HERE: 4950

    have the same format PySpark's log messages (as defined in $SPARK_HOME/conf/log4j2.properties).

    Note that you can still configure the log level of PySpark's log4j logger separately from the log level of the application logger.