I am doing the following in my databricks code. The code is trying to delete some rows from my table under conditions which seam simple to me.
from datetime import datetime, timedelta
today = datetime.today()
yesterday = today - timedelta(days = int(look_back_days))
yesterday_str = yesterday.strftime('%Y/%m/%d')
query = f"""DELETE FROM table_name a WHERE AWB_ID IN (SELECT * FROM a WHERE a.DW_CRT_RUN_DT >= '{yesterday_str}' or a.dw_upd_run_dt >= '{yesterday_str}' and a.awb_Stus_cd = 'VOID' and a.dw_upd_pgm_id = 'DSPD0580')"""
spark.sql(query)
However, I got the following error.
AnalysisException: Multi-column In predicates are not supported in the DELETE condition
I also put what was suggested by @kombajn-zbożowy below, but it is still throwing the same error
My own answer after couple years of activities in databricks
Spark SQL does not support multi-column subqueries (e.g., SELECT *) in IN clauses for DELETE operations. You need to rewrite your query to avoid using SELECT * and restructure it using a join or explicitly specify columns
possible workaround is
query = f"""
DELETE FROM table_name a
WHERE EXISTS (
SELECT 1
FROM table_name b
WHERE b.AWB_ID = a.AWB_ID
AND (b.DW_CRT_RUN_DT >= '{yesterday_str}'
OR b.dw_upd_run_dt >= '{yesterday_str}')
AND b.awb_Stus_cd = 'VOID'
AND b.dw_upd_pgm_id = 'DSPD0580'
)
"""
spark.sql(query)