scalaapache-sparkorcspark-hive

External table is empty when ORC data is saved


I want to write ORC data into an external Hive table from the Spark data frame. When I save the data frame as a table the data is sent to existing external table, however, when I try to save the data in ORC format into the directory and then read this data from the external table, it is not displayed.

What could be the reason for the data absences in the second case?

How it works:

val dataDir = "/tmp/avro_data"
sql("CREATE EXTERNAL TABLE avro_random(name string, age int, phone string, city string, country string) STORED AS ORC LOCATION '$dataDir'")

dataframe
  .write
  .mode(SaveMode.Overwrite)
  .saveAsTable("avro_random")

sql("SELECT * FROM avro_random").show()

The code that returns empty external table:

val dataDir = "/tmp/avro_data"
sql("CREATE EXTERNAL TABLE avro_random(name string, age int, phone string, city string, country string) STORED AS ORC LOCATION '$dataDir'")

dataframe
  .write
  .mode(SaveMode.Overwrite)
  .orc(dataDir)

sql("SELECT * FROM avro_random").show()

Solution

  • Only saveAsTable will work with Hive. If you use .orc(dataDir) spark will write orc to dataDir without Hive support.

    For the second case. If you try to load orc from dataDir by spark, it will work

    dataframe
      .write
      .mode(SaveMode.Overwrite)
      .orc(dataDir)
    
    // New code is below
    val newDf = spark.read.orc(dataDir)
    newDf.createOrReplaceTempView("avro_random_orc") // Register TMP table instead to support sql
    
    sql("SELECT * FROM avro_random_orc").show()