scalaapache-sparkapache-spark-sqlspark-shell

spark-shell load existing hive table by partition?


In spark-shell, how do I load an existing Hive table, but only one of its partitions?

val df = spark.read.format("orc").load("mytable")

I was looking for a way so it only loads one particular partition of this table.

Thanks!


Solution

  • There is no direct way in spark.read.format but you can use where condition

    val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
    

    unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.

    see here DataFrameReader

     def load(paths: String*): DataFrame = {
    ...
    }
    

    In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)

    when you say df.count then your parition column will be appled on data path of orc.