If I run the following trivial example, I get the expected output:
library(dplyr)
library(dtplyr)
library(data.table)
dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
as.data.table()
> b a.x a.y
> 1: 6 1 a
> 2: 7 2 b
> 3: 8 3 c
> 4: 9 4 d
> 5: 10 5 e
Note that the conflicting columns a
are properly managed, using the standard dplyr
format of adding .x
and .y
suffixes.
However, if I now try to drop one of the columns:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
-a.y
) %>%
as.data.table()
> Error in is_character(x) : object 'a.y' not found
Interestingly, if I try to select one of the a
columns (select(a.x)
), I get the same error, but... if I instead try select(a)
(selecting a column which shouldn't really exist anymore), I get the following output:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
a
) %>%
as.data.table()
> a.b
> 1: 1
> 2: 2
> 3: 3
> 4: 4
> 5: 5
where the selected column is clearly dt1$a
, but for some reason the given column name is a.b
. (if I try select(a.b)
, I get the same object not found
error).
Meanwhile, if I try to drop a
, both a
columns are dropped:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
-a
) %>%
as.data.table()
> b
> 1: 6
> 2: 7
> 3: 8
> 4: 9
> 5: 10
So, how can I use select
with joins where the tables have conflicting (and not joined-by) columns?
EDIT:
As mentioned in some answers, I can obviously execute the lazy evaluation before the select, which works. However, it throws a warning (since I'd like to keep my object as a data.table
, not a data.frame
) so it doesn't seem to be the intended method:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
as.data.table() %>%
select(
-a.x
)
> b a.y
> 1: 6 a
> 2: 7 b
> 3: 8 c
> 4: 9 d
> 5: 10 e
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data
> frame implementation, and is likely to be inefficient.
> *
> * To suppress this message, either generate a data.table translation with
> * `lazy_dt()` or convert to a data frame or tibble with
> * `as.data.frame()`/`as_tibble()`.
This is a bug in the current release of dtplyr
(1.0.0), but has now been fixed in the development version.