Where do you find print statements from your Glue ETL jobs? You guys, this is killing me. Why is this not the easiest thing to find?
I am trying to look at properties of my tables and do some general debugging in the console for an AWS Glue ETL job. Throughout I log some things and print some things. The built in functions to print dynamic frame schema though return None
, so I can't easily embed them into a log string. Here is a the gist of my job:
import some_stuff
...
# Create and join tables
customer_churn = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_customer_churn)
customer_churn = cust_joined.join(paths1=["customer id"], paths2=['id'], frame2=other_table)
logger.info(f"Customer_churn_joined:\n")
customer_churn.printSchema()
# ---- Write out the combined file ----
s_customer_churn = customer_churn.toDF().select("customer id")
logger.info(f"Customer_churn_just_cust_id:\n")
s_customer_churn.printSchema()
s_customer_churn.write.option("header","true").format("csv").mode('Overwrite').save(output_dir)
logger.info("output_dir:" + output_dir)
I looked in the contiuous logging tab and I get the logging statements, but no print statements come through. I saw the Output logs going to Cloudwatch (screenshot, bottom-right), so I clicked that link, but none of the logs had my print statements. Why is this not the easiest thing to see?
All logs
Look in the one without the suffix. There is a lot of items logged here so you will have to search.
Output logs
print()
dataframe.printSchema(), dataframe.show()
Look in the one without the suffix
The others with the suffix are from the individual workers depending on the number of workers defined for the job.
Example
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()
logger.info('Hello from logger.info will be in All Logs')
print('print will show up in output log')
testDf = spark.createDataFrame([Row(test_data='dataframe printSchema() and show() will be in the output log')])
testDf.printSchema()
testDf.show()
job.commit()
Open All logs
and you can search using the search box
Just understand you might have to scroll to the top and click There are older events to load. Load more.
before you would find it