rsql-serverfilterdbplyr

Join column from a table in SQL Server without write permissions to a data.frame in R using dbplyr


I have a data.frame in R that contains pairwise combinations of genetic sequence IDs, grouped by cluster name. The list of pairs has been deduplicated and there are no self-comparisons.

I am trying to add the genetic distance for each pair, which is available in a Microsoft SQL Server database table that also has the two ID columns. I have strictly read-only access to the database (I cannot write/create tables at all) and am interacting with it via the {dbplyr} package.

Normally I would just do a left_join() to add the distances column, but I am not able to do that, because my first table with the pairwise IDs is in the R environment, and the distances table is on the SQL Server. The distances table has millions of records, so importing the whole thing is not an option.

I can provide a list of values to filter on, but providing my list of IDs separately for Id1 and Id2 is not specific enough, since each ID has been compared with every other ID in the database, and I only need the within-cluster comparisons for a specific subset of clusters and IDs.

What I think I need is some way of filtering the distances I need from the distances table on the server using the intersection of my two lists of IDs, but I don't know how to do this. Any solutions using preferably dbplyr (or SQL but must be a filter solution) would be very helpful.

Note: I have read some posts regarding filtering on two columns, but the responses all suggest some type of join, which I cannot do because my tables are in two different places and my lack of write permissions in the database.

Here is a toy example of my table with the pairs of IDs:

# Load required libraries:
library(DBI)
library(dbplyr)
library(tidyverse)

# Example data.frame in R containing the pairs of IDs to filter on:
  idpairs <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
                        cluster = c(rep("A", 4), 
                                    rep("B", 4), 
                                    rep("C", 4))) %>% 
    group_by(cluster) %>% 
    expand(Id1 = id, Id2 = id) %>% 
    filter(Id1 < Id2) %>% 
    ungroup()

This returns the following table:

> idpairs
# A tibble: 18 × 3
   cluster   Id1   Id2
   <chr>   <dbl> <dbl>
 1 A           1     2
 2 A           1     3
 3 A           1     4
 4 A           2     3
 5 A           2     4
 6 A           3     4
 7 B           5     6
 8 B           5     7
 9 B           5     8
10 B           6     7
11 B           6     8
12 B           7     8
13 C           9    10
14 C           9    11
15 C           9    12
16 C          10    11
17 C          10    12
18 C          11    12

My connection to the Microsoft SQL Server database is called con and I created a connection to the distances table within that database using dbplyr::tbl() which is called distances_tbl.

Here is what distances_tbl looks like:

  # Example distances table (this one is on the server):
  distances_tbl <- data.frame(
    id = sample(seq(6000:7000), size = 100),
    Id1 = sample(seq(1:50), size = 100, replace = TRUE),
    Id2 = sample(seq(1:50), size = 100, replace = TRUE),
    Distance = sample(seq(0:30), size = 100, replace = TRUE)
  )

>  distances_tbl %>% 
+     head(n = 10)
    id Id1 Id2 Distance
1  477  31  34       23
2  261  36  17       14
3  132  47  34        8
4  184  31  36       19
5   24   7  35       19
6   47  27   5       27
7  759  17  38       18
8  670  21  37       19
9  145  12  38        3
10  29  42  14       30

You can see in the table that a lot of the comparisons are not the ones I need - either because they compare IDs that are not in my dataset at all, or one of the IDs is in my idpairs data but is paired with another ID which is not.

For example:


Solution

  • As it turns out, I can actually create a temp table, but I was going about it in the wrong way. With my permissions, dbWriteTable(con, idpairs, temp = TRUE) does not work as I get a 'permission denied' error.

    However, I have two options that do work with my read-only permissions. Option 1 is slower, but necessary if one of the tables I want to join only exists in my R environment. Option 2 is much faster but only possible if both tables exist in the same database, or if one of the tables is a sub-table of a table that is already in the same database.

    Option 1 - data.frame that I want to join is only in R:

    I can write a temporary table to the database with copy_to():

    # Create a connection to the database:
    con <- DBI::dbConnect(odbc::odbc(),
                          driver = "SQL server",
                          server = "myserver.net",
                          database = "wgs",
                          trusted_connection = TRUE)
    
    # Copy idpairs to database, inner_join distances and collect:
    idpairs_with_dist <- copy_to(dest = con, 
                                 df = idpairs,
                                 name = "idpairs",
                                 temporary = TRUE) %>% 
        inner_join(y = distance_tbl %>% select(Id1, Id2, Distance)) %>% 
        collect()
    

    Option 2 - data.frame that I want to join is derived from another table in the same database:

    Alternatively, since idpairs is actually created from another table in the same database, I can join idpairs to other tables that are also in the database, so long as I don't collect() them into R until the joins are complete. For example, I can create idpairs directly from entry_tbl, keep it in the database and join it to the distance_tbl (which is also in the same database) then collect() at the end:

    # Suppose there is a table called 'entrytable' already in the database:
    entrytable <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
                             cluster = c(rep("A", 4), 
                                         rep("B", 4), 
                                         rep("C", 4)))
    
    # Create connection to entrytable:
    entry_tbl <- tbl(con, "entrytable")
    
    # Create the pairwise ID comparison from this table in the database:
    idpairs <- entry_tbl %>%
                group_by(cluster) %>% 
                expand(Id1 = id, Id2 = id) %>% 
                filter(Id1 < Id2) %>% 
                ungroup()
    
    # Create connection to 'distances' table in the same database:
    distance_tbl <- tbl(con, "distances")
    
    # Join the temp table 'idpairs' to 'distance_tbl' in the database:
    cluster_distances <- distance_tbl %>%
                          inner_join(y = idpairs, by = join_by(Id1, Id2)) %>%
                          collect() # Only bring results into R at the end