I am not understanding why the order of keys in data tables being joined makes a difference.
Consider the following (expected) result of joining tmp_dt_1
to tmp_dt_2
:
tmp_dt_1 <-
data.table(a = c(1, 2), b = c(2, 1), c = 1:2)
tmp_dt_2 <-
data.table(a = c(1, 2), b = c(2, 1), d = 1:2)
setkeyv(tmp_dt_1, c("a", "b"))
setkeyv(tmp_dt_2, c("a", "b"))
tmp_dt_1
# Key: <a, b>
# a b c
# <num> <num> <int>
# 1: 1 2 1
# 2: 2 1 2
tmp_dt_2
# Key: <a, b>
# a b d
# <num> <num> <int>
# 1: 1 2 1
# 2: 2 1 2
tmp_dt_1[tmp_dt_2]
# Key: <a, b>
# a b c d
# <num> <num> <int> <int>
# 1: 1 2 1 1
# 2: 2 1 2 2
Now, key tmp_dt_2
with column b
followed by a
. This produces a join that is different (and wrong) compared to the above:
setkeyv(tmp_dt_2, c("b", "a"))
tmp_dt_1
# Key: <a, b>
# a b c
# <num> <num> <int>
# 1: 1 2 1
# 2: 2 1 2
tmp_dt_2
# Key: <b, a>
# a b d
# <num> <num> <int>
# 1: 2 1 2
# 2: 1 2 1
tmp_dt_1[tmp_dt_2]
# Key: <a, b>
# a b c d
# <num> <num> <int> <int>
# 1: 1 2 1 2
# 2: 2 1 2 1
In What are primary and secondary indexes in data.table?, the data.table
docs say:
setkey(DT, col1, col2)
orders the rows by columncol1
then within each group ofcol1
it orders bycol2
. This is a primary index. The row order is changed by reference in RAM... We use the words index and key interchangeably.
(Emphasis mine)
The keys determine row order in RAM, which allows for low-level optimisations that take advantage of this order. For example, grouping operations work on data that is contiguous in RAM, which can minimise page fetches and copy memory in bulk.
This means that order matters. So when you specify the keys in the order that you do in the second case, you are actually doing this:
tmp_dt_1[
tmp_dt_2,
on = .(a = b, b = a)
]
We can see this more clearly if we only set one key:
setkey(tmp_dt_1, a)
setkey(tmp_dt_2, b)
all(
tmp_dt_1[tmp_dt_2] ==
tmp_dt_1[tmp_dt_2, on = .(a)]
) # FALSE
all(
tmp_dt_1[tmp_dt_2] ==
tmp_dt_1[tmp_dt_2, on = .(a = b)]
) # TRUE
If you're concerned about this, the easiest thing would be to set keys using a vector which is the same for both data.table
s e.g.
keys <- c("a", "b")
setkeyv(tmp_dt_1, keys)
setkeyv(tmp_dt_2, keys)
Having said that, you should not generally set a key if all you are going to do is join. There is no performance advantage unless you're going to do repeated operations grouped by the primary key. For more about what happens when you set a key, see What is the purpose of setting a key in data.table? It's worth noting that that answer, written by one of the authors of data.table
, says:
In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using
on=
wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.