pythonpysparkdelta-lake

Get EXPLAIN from Delta Lake MERGE in PySpark?


Using Python 3.10, delta-spark 2.4.0, I need to see the execution plan of a MERGE statement in PySpark.

For a dataframe operation, a df.explain() provides it, but I have not found a method for seeing the physical plan of a merge().

Is there a method to see the equivalent of explain(mode="extended") for the following?

df = spark.sql("SELECT * FROM table")

tablePath = "/path/to/deltalake"

tbl = DeltaTable.forPath(spark, tablePath)

table.alias("target") \
    .merge(
        source=df.alias("source"),
        condition=condition) \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

Solution

  • The issue#893 was raised for "Printing execution plan for merge operation with python API" which is the same thing you are asking for.

    But unfortunately, they decided not to build it due to the following reason as mentioned in #910(comment):

    As the explain method, currently, cannot output the physical execution plan inside merge, we have decided to not add it right now. We can revisit this API when we have an approach to output the real execution plan in merge.

    As an alternative, you may try analyzing the query execution details in the Spark UI and keep an eye on the official Delta Lake and PySpark documentation for any updates regarding the exposure of the physical plan for the merge operation.