amazon-web-servicesamazon-s3amazon-athenaprestoamazon-s3-select

Query S3 in parallel with SQL and partitioning


Is it possible to make simplest concurrent SQL queries on S3 file with partitioning?

The problem it looks like you have to choose 2 options from 3.

You can make concurrent SQL queries against S3 with S3 Select. But S3 Select doesn't support partitioning, it also works on single file at a time.

Athena support partitioning and SQL queries, but it has limit of 20 concurrent queries. Limit could be increased, but there is no guarantees and uper line.

You can configure HBase that works on S3 through EMRFS, but that requires to much configurations. And I suppose data should be written through HBase (another format). Maybe more simple solution?


Solution

  • You can also use such managed services like AWS Glue or AWS EMR.

    Example code which you can run in Glue:

    import sys 
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    def load_dict(_database,_table_name):
        ds = glueContext.create_dynamic_frame.from_catalog(database = _database, table_name = _table_name, transformation_ctx = "ds_table")
        df = ds.toDF()
        df.createOrReplaceTempView(_table_name)   
        return df
    
    df_tab1=load_dict("exampledb","tab1")
    df_sql=spark.sql( "select m.col1, m.col2 from tab1 m")
    df_sql.write.mode('overwrite').options(header=True, delimiter = '|').format('csv').save("s3://com.example.data/tab2")
    
    job.commit()
    

    You can also consider to use Amazon Redshift Spectrum.

    https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/