I have wasted a considerable amount of time trying to make exceptAll() pyspark function, and as far as I understood it was failing (not recognizing existing on target table) due to the fact that both dataframes column order was slightly different. Therefore, I want to confirm and further understand
Thanks
Yes, you are correct. The below would work.
# Simulating data, could be from file
data1 = [("abc", 1), ("def", 2), ("xyz", 26)]
cols1 = ["val1", "val2"]
data2 = [(1, "abc"), (2, "def"), (26, "xyz")]
cols2 = ["val2", "val1"]
df1 = spark.createDataFrame(data1, cols1)
df2 = spark.createDataFrame(data2, cols2)
# Get col names from both DF's, nice little feature
columns_df1 = df1.columns
columns_df2 = df2.columns
# Sorting, no renaming done, but that could be needed as well in some cases, e.g. withColumnRenamed
df1 = df1.select(sorted(columns_df1))
df2 = df2.select(sorted(columns_df2))
# Apply except
df1.exceptAll(df2).show()
#df1.show()
#df2.show()
This below would not work. You would need rename cols.
df1 = spark.createDataFrame(data1, colsA)
df2 = spark.createDataFrame(data2, cols2)
Data types that differ can be compared though.