I get different results if I use order()
in data.frame
and data.table
. For example:
A <- data.frame(one = c("k"), two = c("3_28","31_60","48_68"))
B <- as.data.table(A)
A[order(A$one,A$two), ]
# one two
# 1 k 3_28
# 2 k 31_60
# 3 k 48_68
B[order(B$one, B$two), ]
# one two
# 1: k 31_60
# 2: k 3_28
# 3: k 48_68
I must admit this was a bit of a nasty shock, as I have assumed equivalent results for order()
from data.frame
and data.table
for many years. I guess there is a lot of code I need to check!
Is there any way to ensure order()
gives the same results in data.frame
and data.table
?
Many apologies if this difference in behavior is already well known, and is just an example of my ignorance.
When used inside of a data.table
operation, order(..)
uses data.table:::forder
. According to the Introduction to data.table vignette:
order()
is internally optimised
We can use "-" on a
character
columns within the frame of adata.table
to sort in decreasing order.In addition,
order(...)
within the frame of adata.table
usesdata.table
's internal fast radix orderforder()
. This sort provided such a compelling improvement over R'sbase::order
that the R project adopted thedata.table
algorithm as its default sort in 2016 for R 3.3.0, see?sort
and the R Release NEWS.
From R News, CHANGES IN R 3.3.0, NEW FEATURES:
The radix sort algorithm and implementation from data.table (
forder
) replaces the previous radix (counting) sort and adds a new method fororder()
. Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see?sort
).
The key to see the difference is that data.table:::forder
uses a "fast radix order". If you look at ?base::order
, though, it has an argument method=
:
the method to be used: partial matches are allowed. The default (
"auto"
) implies"radix"
for numeric vectors, integer vectors, logical vectors and factors with fewer than 2^31 elements. Otherwise, it implies"shell"
. For details of methods"shell"
,"quick"
, and"radix"
, see the help forsort
.
Since the second column of your data.table
is not one of numeric
, integer
, logical
, or factor
, then base::order
uses the "shell"
method for sorting, which produces different results.
However, if we force base::order
to use method="radix"
, we get the same result.
order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3
A[order(A$one, A$two, method = "radix"),]
# one two
# 2 k 31_60
# 1 k 3_28
# 3 k 48_68
You can affect the same ordering by using base::order
:
B[base::order(B$one,B$two),]
# one two
# <char> <char>
# 1: k 3_28
# 2: k 31_60
# 3: k 48_68
Added note: all method=
options for base::order
depend on locale except method="radix"
, which then mimics data.table::order
more closely. From ?base::order
:
Except for method ‘"radix"’, the sort order for character vectors will depend on the collating sequence of the locale in use...