I have a data.frame in R that contains pairwise combinations of genetic sequence IDs, grouped by cluster name. The list of pairs has been deduplicated and there are no self-comparisons.
I am trying to add the genetic distance for each pair, which is available in a Microsoft SQL Server database table that also has the two ID columns. I have strictly read-only access to the database (I cannot write/create tables at all) and am interacting with it via the {dbplyr}
package.
Normally I would just do a left_join()
to add the distances column, but I am not able to do that, because my first table with the pairwise IDs is in the R environment, and the distances table is on the SQL Server. The distances table has millions of records, so importing the whole thing is not an option.
I can provide a list of values to filter on, but providing my list of IDs separately for Id1
and Id2
is not specific enough, since each ID has been compared with every other ID in the database, and I only need the within-cluster comparisons for a specific subset of clusters and IDs.
What I think I need is some way of filtering the distances I need from the distances table on the server using the intersection of my two lists of IDs, but I don't know how to do this. Any solutions using preferably dbplyr
(or SQL but must be a filter solution) would be very helpful.
Note: I have read some posts regarding filtering on two columns, but the responses all suggest some type of join, which I cannot do because my tables are in two different places and my lack of write permissions in the database.
Here is a toy example of my table with the pairs of IDs:
# Load required libraries:
library(DBI)
library(dbplyr)
library(tidyverse)
# Example data.frame in R containing the pairs of IDs to filter on:
idpairs <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
cluster = c(rep("A", 4),
rep("B", 4),
rep("C", 4))) %>%
group_by(cluster) %>%
expand(Id1 = id, Id2 = id) %>%
filter(Id1 < Id2) %>%
ungroup()
This returns the following table:
> idpairs
# A tibble: 18 × 3
cluster Id1 Id2
<chr> <dbl> <dbl>
1 A 1 2
2 A 1 3
3 A 1 4
4 A 2 3
5 A 2 4
6 A 3 4
7 B 5 6
8 B 5 7
9 B 5 8
10 B 6 7
11 B 6 8
12 B 7 8
13 C 9 10
14 C 9 11
15 C 9 12
16 C 10 11
17 C 10 12
18 C 11 12
My connection to the Microsoft SQL Server database is called con
and I created a connection to the distances table within that database using dbplyr::tbl()
which is called distances_tbl
.
Here is what distances_tbl
looks like:
# Example distances table (this one is on the server):
distances_tbl <- data.frame(
id = sample(seq(6000:7000), size = 100),
Id1 = sample(seq(1:50), size = 100, replace = TRUE),
Id2 = sample(seq(1:50), size = 100, replace = TRUE),
Distance = sample(seq(0:30), size = 100, replace = TRUE)
)
> distances_tbl %>%
+ head(n = 10)
id Id1 Id2 Distance
1 477 31 34 23
2 261 36 17 14
3 132 47 34 8
4 184 31 36 19
5 24 7 35 19
6 47 27 5 27
7 759 17 38 18
8 670 21 37 19
9 145 12 38 3
10 29 42 14 30
You can see in the table that a lot of the comparisons are not the ones I need - either because they compare IDs that are not in my dataset at all, or one of the IDs is in my idpairs
data but is paired with another ID which is not.
For example:
As it turns out, I can actually create a temp table, but I was going about it in the wrong way. With my permissions, dbWriteTable(con, idpairs, temp = TRUE)
does not work as I get a 'permission denied' error.
However, I have two options that do work with my read-only permissions. Option 1 is slower, but necessary if one of the tables I want to join only exists in my R environment. Option 2 is much faster but only possible if both tables exist in the same database, or if one of the tables is a sub-table of a table that is already in the same database.
Option 1 - data.frame that I want to join is only in R:
I can write a temporary table to the database with copy_to()
:
# Create a connection to the database:
con <- DBI::dbConnect(odbc::odbc(),
driver = "SQL server",
server = "myserver.net",
database = "wgs",
trusted_connection = TRUE)
# Copy idpairs to database, inner_join distances and collect:
idpairs_with_dist <- copy_to(dest = con,
df = idpairs,
name = "idpairs",
temporary = TRUE) %>%
inner_join(y = distance_tbl %>% select(Id1, Id2, Distance)) %>%
collect()
Option 2 - data.frame that I want to join is derived from another table in the same database:
Alternatively, since idpairs
is actually created from another table in the same database, I can join idpairs
to other tables that are also in the database, so long as I don't collect()
them into R until the joins are complete. For example, I can create idpairs
directly from entry_tbl
, keep it in the database and join it to the distance_tbl
(which is also in the same database) then collect()
at the end:
# Suppose there is a table called 'entrytable' already in the database:
entrytable <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
cluster = c(rep("A", 4),
rep("B", 4),
rep("C", 4)))
# Create connection to entrytable:
entry_tbl <- tbl(con, "entrytable")
# Create the pairwise ID comparison from this table in the database:
idpairs <- entry_tbl %>%
group_by(cluster) %>%
expand(Id1 = id, Id2 = id) %>%
filter(Id1 < Id2) %>%
ungroup()
# Create connection to 'distances' table in the same database:
distance_tbl <- tbl(con, "distances")
# Join the temp table 'idpairs' to 'distance_tbl' in the database:
cluster_distances <- distance_tbl %>%
inner_join(y = idpairs, by = join_by(Id1, Id2)) %>%
collect() # Only bring results into R at the end