apache-sparkpysparkapache-spark-sql

How to specify the path where saveAsTable saves files to?


I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter

df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer.partitionBy('col1')\
         .saveAsTable('test_table', format='parquet', mode='overwrite')

The parquet files went to "/tmp/hive/warehouse/...." which is a local tmp directory on my driver.

I did setup hive.metastore.warehouse.dir in hive-site.xml to a "s3a://...." location, but spark doesn't seem to respect to my hive warehouse setting.


Solution

  • Use path.

    df_writer.partitionBy('col1')\
             .saveAsTable('test_table', format='parquet', mode='overwrite',
                          path='s3a://bucket/foo')