ramazon-web-servicesaws-sdkamazon-athenar-paws

How can I get the full result when querying AWS Athena through the paws package in R?


I have been trying to get data from Athena through the paws package into R. So far I have successfully gotten the query to run and return some results. But I get the default max of 1000. I've seen some solutions out there for the Boto3 library in Python, but even though the syntax is similar, I can call a pagination function like the one Boto3 has.

Anyone knows how to do pagination, or how to use the next token argument of the function?

Here is what my code is looking like:

SDK for connecting to AWS services

install.packages('paws')
install.packages('tidyverse')

Creating an Athena object:

athena <- paws::athena()

Writting the query and specifyng the output location:

query <- athena$start_query_execution(QueryString = '
                                                    SELECT *
                                                    FROM db.table
                                                    LIMIT 100000
                                                    ',
                                       ResultConfiguration = list(OutputLocation =
                                                                  "s3://aws-athena..."
                                                                  )
                                      )

Executing the query

result <- athena$get_query_execution(QueryExecutionId = query$QueryExecutionId)

Getting the query output:

output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId)

Parsing into a table object

data <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),nrow = length(unlist(output$ResultSet$Rows[1])))))

colnames(data) <- as.character(unlist(data[1, ]))
data <- data[-1, ]

Solution

  • You might want to consider the noctua package. This package connects R to Athena using paws SDK (DBI Interface). It solves the issue you are having with the limitation of 1000 rows. So your above query will look like:

    library(DBI)
    
    con = dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena...")
    
    dbGetQuery(con, "SELECT * FROM db.table LIMIT 100000")
    

    The package also offers integration into dplyr:

    library(DBI)
    library(dplyr)
    
    con = dbConnect(noctua::athena(), 
                    schema_name = "db", 
                    s3_staging_dir = "s3://aws-athena...")
    
    tbl(con, "table"))