I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())
The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.
Also, here is some info about the environment used to run the code:
Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image: Stage 1 DAG Visualization
My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.
The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:
To get more clarification on the spark stages use explain
function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan
, Optimized Logical Plan
, and Physical Plan
that has been calculated by internal optimiser processes.
To find a more detailed description of the explain
function please visit this LINK