rsnowflake-cloud-data-platformodbcdbi

Getting a SQL compilation error "Compilation memory exhausted." when trying to push an R data frame to Snowflake


I'm using dbWriteTable to create a database table in Snowflake from R. This works nicely when I push a data frame of 350,000 rows, but fails with SQL compilation error: Compilation memory exhausted. when I try to push a table with around 3,000,000 rows.

I'd like to understand why this is happening. Obviously something to do with the size of the data frame, but I can't figure out what. When I save the larger data frame to a csv, it's less than 200MB.

I've looked at showMethods("dbWriteTable", includeDefs = TRUE) and reviewed the logs, but there's nothing obvious showing why this fails. This Snowflake community post suggests it's due to a complex query or TB size data sets. I don't have TBs of data, so maybe it's a query size issue? There's a limit of 1MB for the query size, but the data frame with 350,000 rows is much larger than this when saved as a csv, and I had no problems there.

To be clear this is the only code I'm running:

dbWriteTable(conn = myconn,
             name = table_id,
             value = my_df,
             overwrite = TRUE)

Solution

  • The Snowflake odbc driver has various performance optimisations it can take advantage of when dealing with bulk datasets.

    When inserting a large number of rows it will typically export the data to temporary gzipd csv file(s), PUT them to a temporary internal stage in Snowflake and then perform an efficient Insert process from the file(s). You should be able to observe this behaviour by looking at Query History within the Snowsight UI, or you can run a query like this within R using your odbc session to return Query History.

    select  QUERY_TEXT, TOTAL_ELAPSED_TIME, START_TIME 
    from  table(information_schema.query_history_by_session()) 
    order by start_time desc;
    

    You should see a series of SQL statements something like:

    1. Drop Table ;
    2. Create Table ....
    3. Put file:// .... [1:n of these]
    4. Insert Into (col1, col2, ... coln) values (?,?,...?)
    5. Commit

    If you see something different it may be that your user does not have the required permissions to use the optimised method. e.g. cannot create temporary stage.

    There is an option that you can set to control the number of rows processed in each batch. You can check the setting using getOption("odbc.batch_rows").

    If you do not set this explicitly it should return NULL. When NULL, the behaviour is to execute a single INSERT statement from the data, containing all the rows to be inserted, and this could be exceeding memory and statement size limits.

    If you set it explicitly (e.g. options(odbc.batch_rows = 500000)) the behaviour is to perform Inserts of the requested batch size. In this case [3,4] above will repeat for each batch until all rows have been inserted. You can use this to avoid the batch size being too large.

    I used the following code to test this:

    ncols <- 10
    nrows <- 5000000
    ncell <- nc * nr
    
    data_for_ingest <- as.data.frame(matrix(rnorm(ncell), ncol = ncols, nrow = nrows))
    
    options(odbc.batch_rows = 5000000)
    
    table_id <- DBI::Id(catalog = "SF_TEST", schema = "R_INGEST", table = sprintf("DATA_INGEST_%s_%s",nc,nr))
    
    system.time(
       DBI::dbWriteTable(conn      = odbcCon,
                         name      = table_id,
                         value     = data_for_ingest,
                         overwrite = TRUE) )
    

    I observed the following results:

    odbc.batch_rows # Inserts Seconds
    NULL 1 119
    500000 10 156
    1000000 5 128
    2500000 2 122
    5000000 1 99

    Larger batch sizes are slightly more efficient, but might exceed memory limits. Number of columns, size and data-type in the dataset will also have a bearing on performance and limits.

    If your working with very large datasets that you want to load into Snowflake efficiently you can also export them to CSV or parquet, PUT them to a stage and use COPY to load them. e.g.

    write.csv(data_for_ingest, paste0(tempdir(), "/","data_for_ingest.csv"), row.names=FALSE)
    
    odbc::dbGetQuery(odbcCon, 'CREATE TABLE "DATA_INGEST_PUT_COPY" (
      "V1" DOUBLE PRECISION,
      "V2" DOUBLE PRECISION,
      "V3" DOUBLE PRECISION,
      "V4" DOUBLE PRECISION,
      "V5" DOUBLE PRECISION,
      "V6" DOUBLE PRECISION,
      "V7" DOUBLE PRECISION,
      "V8" DOUBLE PRECISION,
      "V9" DOUBLE PRECISION,
      "V10" DOUBLE PRECISION
      )')
    
    odbc::dbGetQuery(odbcCon, 'CREATE STAGE my_int_stage')
    
    odbc::dbGetQuery(odbcCon, paste0('PUT ', paste0('file://',tempdir(),'/','data_for_ingest.csv'), '@my_int_stage'))
    
    odbc::dbGetQuery(odbcCon, "CREATE OR REPLACE FILE FORMAT my_csv_format
      TYPE = CSV
      FIELD_DELIMITER = ','
      SKIP_HEADER = 1
      NULL_IF = ('NULL', 'null')
      EMPTY_FIELD_AS_NULL = true
      COMPRESSION = gzip;")
    
    odbc::dbGetQuery(odbcCon, "COPY INTO DATA_INGEST_PUT_COPY FROM @my_int_stage FILE_FORMAT = (FORMAT_NAME = 'my_csv_format')")
    

    You may need your Snowflake database administrator to provide your user/role with permissions to create some of these objects within your Snowflake account if your user does not have them already.