I'm using pyspark with HiveWarehouseConnector in HDP3 cluster. There was a change in the schema so I updated my target table using the "alter table" command and added the new columns to the last positions of it by default. Now I'm trying to use the following code to save spark dataframe to it but the columns in the dataframe have alphabetical order and i'm getting the error message below
df = spark.read.json(df_sub_path)
hive.setDatabase('myDB')
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode('append').option('table','target_table').save()
and the error message taced to:
Caused by: java.lang.IllegalArgumentException: Hive column: column_x cannot be found at same index: 77 in dataframe. Found column_y. Aborting as this may lead to loading of incorrect data.
Is there any dynamic way of appending the dataframe to correct location in the hive table? It is important as I expect more columns to be added to the target table.
You can read the target column without rows to get the columns. Then, using select, you can order the column correctly and append it:
target = hive.executeQuery('select * from target_Table where 1=0')
test = spark.createDataFrame(source.collect())
test = test.select(target.columns)