I need to connect spark to my redshift instance to generate data . I am using spark 1.6 with scala 2.10 . Have used compatible jdbc connector and spark-redshift connector. But i am facing a weird problem that is : I am using pyspark
df=sqlContext.read\
.format("com.databricks.spark.redshift")\
.option("query","select top 10 * from fact_table")\
.option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
.option("tempdir","s3a://redshift-archive/").load()
When i do df.show()
then it gives me error of permission denied on my bucket.
This is weird because i can see files being created in my bucket, but they can be read.
PS .I have set accesskey and secret access key also.
PS . I am also confused between s3a and s3n file system. Connector used : https://github.com/databricks/spark-redshift/tree/branch-1.x
It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps
access Create an IAM role in the Redshift Account that redshift can
assume Grant permissions to access the S3 Bucket to the newly created role Associate the role with the Redshift cluster