sqlrandomamazon-sagemakersamplepyathena

Select a random a sample from PyAthena SQL


I am in SageMaker Studio and I have connected to a dataset via PyAthena:

from pyathena import connect
s3_query_results = 'my s3 Location'
region = 'eu-west-2'
workgroup='primary'

Then I have written the following SQL query code to extract a random 1% of the whole dataset I have connected to.

sql_query = '''

        select *
        from myDataMart
        order by random()
        limit 0.01 * (select count(*) from daphnis_client.mart_eligibility)
    
'''

Then I have run the query above as follows:

# Athena connect & execute SQL
cursor = connect(s3_staging_dir=s3_query_results, region_name=region, work_group=workgroup).cursor()

cursor.execute(sql_query)

But I get the following error:

DatabaseError: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported

If I change the code to:

sql_query = '''

        select *
        from myDataMart
        limit 1000;
    
'''

The code works fine. But I guess that those 1,000 records are not a random sample.

How can I extract a random 1% sample from the dataset I am querying and create a "sample weight" field?


Solution

  • If your SQL engine support window functions, you could try:

    sql_query = '''
    
        SELECT *
        FROM (
            SELECT *, ROW_NUMBER() OVER (ORDER BY random()) rn,
                      COUNT(*) OVER () cnt
            FROM myDataMart
        ) t
        WHERE rn / cnt <= 0.01
    
    '''