apache-sparkpysparkapache-spark-sql

Does Spark exceptAll() requires both dataframe columns to be in same order?


I have wasted a considerable amount of time trying to make exceptAll() pyspark function, and as far as I understood it was failing (not recognizing existing on target table) due to the fact that both dataframes column order was slightly different. Therefore, I want to confirm and further understand

  1. Does PySpark exceptAll() function requires both dataframe to have same column order?
  2. List iteIsn't it intelligent enought to map same column names?

Thanks


Solution

  • Yes, you are correct. The below would work.

    # Simulating data, could be from file
    data1 = [("abc", 1), ("def", 2), ("xyz", 26)]
    cols1 = ["val1", "val2"]
    data2 = [(1, "abc"), (2, "def"), (26, "xyz")]
    cols2 = ["val2", "val1"]
    df1 = spark.createDataFrame(data1, cols1)
    df2 = spark.createDataFrame(data2, cols2)
    
    # Get col names from both DF's, nice little feature 
    columns_df1 = df1.columns
    columns_df2 = df2.columns
    
    # Sorting, no renaming done, but that could be needed as well in some cases, e.g. withColumnRenamed
    df1 = df1.select(sorted(columns_df1))
    df2 = df2.select(sorted(columns_df2))
    
    # Apply except
    df1.exceptAll(df2).show()
    #df1.show()
    #df2.show()
    

    This below would not work. You would need rename cols.

    df1 = spark.createDataFrame(data1, colsA)
    df2 = spark.createDataFrame(data2, cols2)
    

    Data types that differ can be compared though.