rdataframesortingdata.tableradix-sort

Result of order() differs when applied to data.frame vs. data.table


I get different results if I use order() in data.frame and data.table. For example:

A <- data.frame(one = c("k"), two = c("3_28","31_60","48_68"))
B <- as.data.table(A)

A[order(A$one,A$two), ]
#   one   two
# 1   k  3_28
# 2   k 31_60
# 3   k 48_68


B[order(B$one, B$two), ]
#    one   two
# 1:   k 31_60
# 2:   k  3_28
# 3:   k 48_68

I must admit this was a bit of a nasty shock, as I have assumed equivalent results for order() from data.frame and data.table for many years. I guess there is a lot of code I need to check!

Is there any way to ensure order() gives the same results in data.frame and data.table?

Many apologies if this difference in behavior is already well known, and is just an example of my ignorance.


Solution

  • When used inside of a data.table operation, order(..) uses data.table:::forder. According to the Introduction to data.table vignette:

    order() is internally optimised

    • We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.

    • In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

    From R News, CHANGES IN R 3.3.0, NEW FEATURES:

    The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort).

    The key to see the difference is that data.table:::forder uses a "fast radix order". If you look at ?base::order, though, it has an argument method=:

    the method to be used: partial matches are allowed. The default ("auto") implies "radix" for numeric vectors, integer vectors, logical vectors and factors with fewer than 2^31 elements. Otherwise, it implies "shell". For details of methods "shell", "quick", and "radix", see the help for sort.

    Since the second column of your data.table is not one of numeric, integer, logical, or factor, then base::order uses the "shell" method for sorting, which produces different results.

    However, if we force base::order to use method="radix", we get the same result.

    order(A$two)
    # [1] 1 2 3
    order(A$two, method="radix")
    # [1] 2 1 3
    
    A[order(A$one, A$two, method = "radix"),]
    #   one   two
    # 2   k 31_60
    # 1   k  3_28
    # 3   k 48_68
    

    You can affect the same ordering by using base::order:

    B[base::order(B$one,B$two),]
    #       one    two
    #    <char> <char>
    # 1:      k   3_28
    # 2:      k  31_60
    # 3:      k  48_68
    

    Added note: all method= options for base::order depend on locale except method="radix", which then mimics data.table::order more closely. From ?base::order:

    Except for method ‘"radix"’, the sort order for character vectors will depend on the collating sequence of the locale in use...