apache-sparkloggingpysparkqubole

Pyspark Logging: Printing information at the wrong log level


Thanks for your time!

I'd like to create and print legible summaries of my (hefty) data to my output when debugging my code, but stop creating and printing those summaries once finished to speed things up. I was advised to use logging, which I implemented. It works as expected for printing text strings as messages to the output - however when printing summaries of dataframes it seems to ignore the log level, creating them and printing them always.

Is logging the right this to be using or is there a better way to do this? I could #block the lines of code or use if statements etc, but it's a massive code and I know that I'll need to make the same checks in the future as further elements are added - seems exactly like what logging should work for.

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])

print "1"
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
logger.debug(df.show())

print "2"
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(df.show())

print "3"
logger.setLevel(logging.INFO)
logger.debug("3 - INFO - Don't print the message or show the collected data")
logger.debug(df.collect())

print "4"
logger.setLevel(logging.DEBUG)
logger.debug("4 - DEBUG - Print the message and the collected data")
logger.debug(df.collect())

Output:

1
DEBUG:__main__:1 - DEBUG - Print the message and show the table
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
DEBUG:__main__:None
2
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
3
4
DEBUG:__main__:4 - DEBUG - Print the message and the collected data
DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

Solution

  • Logging working as expected, if we are using df.show() (or) df.collect() are actions as spark executes then even though they are in logger.debug.

    One workaround you can do is by storing collect()/take(n) result to a variable then use the variable in your logging.

    from pyspark.sql.functions import col,count
    import logging
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])
    
    #storing results but don't use collect on huge dataset instead use `.take`
    res=df.collect() 
    
    #get 10 records from df
    res=df.take(10)
    
    print "1"
    #1
    logger.setLevel(logging.DEBUG)
    logger.debug("1 - DEBUG - Print the message and show the table")
    #DEBUG:__main__:1 - DEBUG - Print the message and show the table
    logger.debug(res)
    #DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]
    
    print "2"
    #2
    logger.setLevel(logging.INFO)
    logger.debug("2 - INFO - Don't print the message or show the table")
    logger.debug(res) #this won't print as loglevel is INFO.
    logger.info("result: " + str(res)) #this will get printed out
    #INFO:__main__:result: [Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]
    

    Use .take instead of .collect().