postgresqldbplyrwrds

How do I get local data into a read-only database using dplyr?


WRDS is a leading provider of research data to academics and other researchers in business and related fields. WRDS provides a PostgreSQL database, but this is a read-only database.

For some tasks, the inability to write data to the database is very constraining. For example, if I want to run an event study using daily stock returns, I will need to merge my (relatively small) local data set events with crsp.dsf, which is about 18GB of data.

One option is to maintain my own database with a copy of crsp.dsf and write events to that database and merge there. But I am looking for an option that allows me to use the WRDS database for this purpose. Unfortunately, there is no way to use copy_to or dbWriteTable, as the WRDS database is read-only.


Solution

  • A more up-to-date answer than the one below is to use the copy_inline function from the dbplyr package, which was added following an issue filed on this topic.


    One option is to use something like the following function, which turns a local data frame into a remote data frame using SQL even when using a read-only connection.

    df_to_pg <- function(df, conn) {
    
        collapse <- function(x) paste0("(", paste(x, collapse = ", "), ")")
    
        names <- paste(DBI::dbQuoteIdentifier(conn, names(df)), collapse = ", ")
    
        values <-
            df %>%
            lapply(DBI::dbQuoteLiteral, conn = conn) %>%
            purrr::transpose() %>%
            lapply(collapse) %>%
            paste(collapse = ",\n")
    
        the_sql <- paste("SELECT * FROM (VALUES", values, ") AS t (", names, ")")
    
        temp_df_sql <- dplyr::tbl(conn, dplyr::sql(the_sql))
        
        return(temp_df_sql)
    }
    

    Here is an illustration of the function in use. Function has been tested on PostgreSQL and SQL Server, but will not work on SQLite (due to lack of VALUES keyword that works in this way). I believe it should work on MySQL or Oracle, as these have the VALUES keyword.

    library(dplyr, warn.conflicts = FALSE)
    library(DBI)
       
    pg <- dbConnect(RPostgres::Postgres())     
    
    events <- tibble(firm_ids = 10000:10024L,
                     date = seq(from = as.Date("2020-03-14"), 
                                length = length(firm_ids), 
                                by = 1))
    events
    #> # A tibble: 25 x 2
    #>    firm_ids date      
    #>       <int> <date>    
    #>  1    10000 2020-03-14
    #>  2    10001 2020-03-15
    #>  3    10002 2020-03-16
    #>  4    10003 2020-03-17
    #>  5    10004 2020-03-18
    #>  6    10005 2020-03-19
    #>  7    10006 2020-03-20
    #>  8    10007 2020-03-21
    #>  9    10008 2020-03-22
    #> 10    10009 2020-03-23
    #> # … with 15 more rows
    
    events_pg <- df_to_pg(events, pg)
    events_pg
    #> # Source:   SQL [?? x 2]
    #> # Database: postgres [iangow@/tmp:5432/crsp]
    #>    firm_ids date      
    #>       <int> <date>    
    #>  1    10000 2020-03-14
    #>  2    10001 2020-03-15
    #>  3    10002 2020-03-16
    #>  4    10003 2020-03-17
    #>  5    10004 2020-03-18
    #>  6    10005 2020-03-19
    #>  7    10006 2020-03-20
    #>  8    10007 2020-03-21
    #>  9    10008 2020-03-22
    #> 10    10009 2020-03-23
    #> # … with more rows
    

    Created on 2021-04-01 by the reprex package (v1.0.0)