I have set up a local (docker) spark cluster as provided by bitnami. Everything is running fine. I can submit simple spark jobs and run them and i can interact with pyspark.
Now i try to set up basic logging (my sources are stackoverflow and github gist) Mainly i see 2 ways first:
log4jref = sc._jvm.org.apache.log4j
logger = log4jref.getLogger(__name__)
If i am using this in pyspark console i see at least the output in console:
>>> logger.info("something")
24/04/15 14:24:31 INFO __main__: something
If i am using the "cleaner solution":
import logging
logger = logging.getLogger(__name__)
logger.info("here we go")
i do not see anything returned in pyspark console.
In both cases i do not see my outputs in docker logs spark-spark-worker-1
nor on master docker logs spark-spark-1
(other outputs are there).
I also tried to look in /opt/bitnami/spark/logs
So my question is: where to find my log messages in general? And what of the 2 ways is better? In a first steps i just need a quick way to get feedback from my running code - making it production ready comes later (if not easily possible right from start)
little update
the job that i am running for test looks like this:
# Create a SparkSession
spark = SparkSession.builder \
.appName("My App") \
.getOrCreate()
log = logging.getLogger(__name__)
rdd = spark.sparkContext.parallelize(range(1, 100))
log.error("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
log.error(f"THE SUM IS HERE: {rdd.sum()}")
print(f"THE SUM IS HERE: {rdd.sum()}")
print("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
# Stop the SparkSession
spark.stop()
While i see in logs that this job runs successfully i can't see my log entries nor the print. All this places seem to show the same log:
If you define a logger in this way you get two logging systems.
Assume the PySpark script is called myApp.py
and gets submitted with spark-submit myApp.py
:
from pyspark.sql import SparkSession
import logging
# Create a SparkSession
spark = SparkSession.builder \
.appName("My App") \
.getOrCreate()
# define your logger
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)
#sc = spark.sparkContext
#sc.setLogLevel("WARN")
rdd = spark.sparkContext.parallelize(range(1, 100))
logger.error("MY LOGGER THE SUM IS HERE: %s", rdd.sum())
print(f"THE SUM IS HERE: {rdd.sum()}")
# Stop the SparkSession
spark.stop()
Then you get log messages from two logging systems:
sc = spark.sparkContext
and sc.setLogLevel("WARN")
)logger
defined in the script using the logging
module (configured with logging.basicConfig...
)The script's log message will appear as
ERROR:__main__:MY LOGGER THE SUM IS HERE: 4950
If you want to use a single, unified logging system in your PySpark applications modify the script like this:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("My App") \
.getOrCreate()
sc = spark.sparkContext
log4jLogger = sc._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)
logger.setLevel(log4jLogger.Level.WARN) # Set desired logging level
#sc = spark.sparkContext
#sc.setLogLevel("WARN")
rdd = spark.sparkContext.parallelize(range(1, 100))
logger.error(f"LOG THE SUM IS HERE: {rdd.sum()}")
print(f"THE SUM IS HERE: {rdd.sum()}")
# Stop the SparkSession
spark.stop()
Now your log messages look like
25/01/02 09:01:30 ERROR __main__: LOG THE SUM IS HERE: 4950
have the same format PySpark's log messages (as defined in $SPARK_HOME/conf/log4j2.properties
).
Note that you can still configure the log level of PySpark's log4j
logger separately from the log level of the application logger
.