Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer). I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ? What is Query 0 which is related to ... zero jobs ?
There are a few things in this. First thing, is that a job is triggered for an action, and tranformations are not really part of it (they're computed during an action, but a single action can do multiple transformations). In your case, reading, tranformation and sorting, all these steps would take place when the action is triggered
Please note that reading from snowflake doesn't trigger a job (this is an assumption as the same behaviour is exhibited by Hive) because snowflake already has the metadata which spark needs by traversing the files. If you'll read a parquet file directly, it'll trigger a different job, and you'll be able to see the job description.
Now comes the part of you naming your job
sc.setJobDescription("Step Snowflake Reader")
This will name the job that was triggered by your write action. And this action in turn is calling multiple jobs (but are still part of the last action you're performing, refer here for more details see this post
Similarly, the last configuration that you make before calling an action is picked up (Same thing happens for setting shufflePartition for instance, you may wanna have a perticular step with more or less shuffle, but for 1 complete action, it'll be set to a single value)
Hope this answers your question.