I want to write ORC data into an external Hive table from the Spark data frame. When I save the data frame as a table the data is sent to existing external table, however, when I try to save the data in ORC format into the directory and then read this data from the external table, it is not displayed.
What could be the reason for the data absences in the second case?
How it works:
val dataDir = "/tmp/avro_data"
sql("CREATE EXTERNAL TABLE avro_random(name string, age int, phone string, city string, country string) STORED AS ORC LOCATION '$dataDir'")
dataframe
.write
.mode(SaveMode.Overwrite)
.saveAsTable("avro_random")
sql("SELECT * FROM avro_random").show()
The code that returns empty external table:
val dataDir = "/tmp/avro_data"
sql("CREATE EXTERNAL TABLE avro_random(name string, age int, phone string, city string, country string) STORED AS ORC LOCATION '$dataDir'")
dataframe
.write
.mode(SaveMode.Overwrite)
.orc(dataDir)
sql("SELECT * FROM avro_random").show()
Only saveAsTable
will work with Hive.
If you use .orc(dataDir)
spark will write orc to dataDir
without Hive support.
For the second case. If you try to load orc from dataDir
by spark, it will work
dataframe
.write
.mode(SaveMode.Overwrite)
.orc(dataDir)
// New code is below
val newDf = spark.read.orc(dataDir)
newDf.createOrReplaceTempView("avro_random_orc") // Register TMP table instead to support sql
sql("SELECT * FROM avro_random_orc").show()