roracle-databaser-dbiroracle

Which DBI function for statements like `create table <tabX> as select * from <tabY>` in R?


I am using package:DBI/package:ROracle.

conn <- dbConnect(ROracle::Oracle(), ...)

I need to create a table from a select query in another table (i.e. a statement like create table <tabX> as select * from <tabY>).

There seems to be several functions that can perform this task, e.g.:

dbSendQuery(conn, "create table tab1 as select * from bigtable")
# Statement:            create table tab1 as select * from bigtable 
# Rows affected:        28196 
# Row count:            0 
# Select statement:     FALSE 
# Statement completed:  TRUE 
# OCI prefetch:         FALSE 
# Bulk read:            1000 
# Bulk write:           1000 

Or:

dbExecute(conn, "create table tab2 as select * from bigtable")
# [1] 28196

Or even:

tab3 <- dbGetQuery(conn, "select * from bigtable")
dbWriteTable(conn = conn, "TAB3", tab3)
# [1] TRUE

Each method seems to work but I guess there is differences in performance/best pratice. What is the best/most efficient way to run statements like create table <tabX> as select * from <tabY>?

I did not find any hint in package:DBI and package:ROracle help pages.


Solution

  • Up front: use dbExecute for this; don't use dbSendQuery, that function suggests the expectation of returned data (though still works).

    dbSendQuery should only be used when you expect data in return; most connections will do just fine even if you mis-use it, but that's the design of it. Instead, use dbSendStatement/dbClearResult or better yet just dbExecute.

    The following are pairs of perfectly-equivalent pathways:

    If you choose dbSend*, one should always call dbClearResult when done with the statement/fetch. (R will often clean up after you, but if something goes wrong here -- and I have hit this a few times over the last few years -- the connection locks up and you must recreate it. This can leave orphan connections on the database as well.)

    I think most use-cases are a single-query-and-out, meaning dbGetQuery and dbExecute are the easiest to use. However, there are times when you may want to repeat a query. An example from ?dbSendQuery:

         # Pass multiple sets of values with dbBind():
         rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = ?")
         dbBind(rs, list(6L))
         dbFetch(rs)
         dbBind(rs, list(8L))
         dbFetch(rs)
         dbClearResult(rs)
    

    (I think it's a little hasty in that documentation to dbFetch without capturing the data ... I would expect dat <- dbFetch(..), discarding the return value here seems counter-productive.)

    One advantage to doing this multi-step (requiring dbClearResult) is with more complex queries: database servers in general tend to "compile" or optimize a query based on its execution engine. This is not always a very expensive step for the server to execute, and it can pay huge dividends on data retrieval. The server often caches this optimized query, and when it sees the same query it uses the already-optimized version of the query. This is one case where using parameter-binding can really help, as the query is identical in repeated use and therefore never needs to be re-optimized.

    FYI, parameter-binding can be done repeatedly as shown above using dbBind, or it can be done using dbGetQuery using the params= argument. For instance, this equivalent set of expressions will return the same results as above:

    qry <- "SELECT * FROM mtcars WHERE cyl = ?"
    dat6 <- dbGetQuery(con, qry, params = list(6L))
    dat8 <- dbGetQuery(con, qry, params = list(8L))
    

    As for dbWriteTable, for me it's mostly a matter of convenience for quick work. There are times when the DBI/ODBC connection uses the wrong datatype on the server (e.g., SQL Server's DATETIME instead of DATETIMEOFFSET; or NVARCHAR(32) versus varchar(max)), so if I need something quickly, I'll use dbWriteTable, otherwise I formally define the table with the server datatypes that I know I want, as in dbExecute(con, "create table quux (...)"). This is by far not a "best practice", it is heavily rooted in preference and convenience. For data that is easy (float/integer/string) and the server default datatypes are acceptable, dbWriteTable is perfectly fine. One can also use dbCreateTable (which creates it without uploading data), which allows you to specify the fields with a bit more control.