I am in SageMaker Studio and I have connected to a dataset via PyAthena:
from pyathena import connect
s3_query_results = 'my s3 Location'
region = 'eu-west-2'
workgroup='primary'
Then I have written the following SQL query code to extract a random 1% of the whole dataset I have connected to.
sql_query = '''
select *
from myDataMart
order by random()
limit 0.01 * (select count(*) from daphnis_client.mart_eligibility)
'''
Then I have run the query above as follows:
# Athena connect & execute SQL
cursor = connect(s3_staging_dir=s3_query_results, region_name=region, work_group=workgroup).cursor()
cursor.execute(sql_query)
But I get the following error:
DatabaseError: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported
If I change the code to:
sql_query = '''
select *
from myDataMart
limit 1000;
'''
The code works fine. But I guess that those 1,000 records are not a random sample.
How can I extract a random 1% sample from the dataset I am querying and create a "sample weight" field?
If your SQL engine support window functions, you could try:
sql_query = '''
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY random()) rn,
COUNT(*) OVER () cnt
FROM myDataMart
) t
WHERE rn / cnt <= 0.01
'''