pysparkamazon-redshiftspark-redshift

issue while connecting spark to redshift using spark -redshift connector


I need to connect spark to my redshift instance to generate data . I am using spark 1.6 with scala 2.10 . Have used compatible jdbc connector and spark-redshift connector. But i am facing a weird problem that is : I am using pyspark

df=sqlContext.read\
    .format("com.databricks.spark.redshift")\
    .option("query","select top 10 * from fact_table")\
    .option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
    .option("tempdir","s3a://redshift-archive/").load()

When i do df.show() then it gives me error of permission denied on my bucket. This is weird because i can see files being created in my bucket, but they can be read.

PS .I have set accesskey and secret access key also.

PS . I am also confused between s3a and s3n file system. Connector used : https://github.com/databricks/spark-redshift/tree/branch-1.x


Solution

  • It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps

    1. Add a bucket policy to that bucket that allows the Redshift Account
    2. access Create an IAM role in the Redshift Account that redshift can

    3. assume Grant permissions to access the S3 Bucket to the newly created role Associate the role with the Redshift cluster

    4. Run COPY statements