rdataframe

In R, does tapply preserve the row order in a dataframe?


In R I have a dataframe, minimal example like this:

test <- data.frame(cat=rep(c("A", "B"), each=3), rank=c(1:3, 1:3), data1=5:10, data2=(1:6)^2)

I am using a tapply on the column cat with a function:

result <- tapply(test, test$cat, function(x) max(abs(cumsum(x$data1) - cumsum(x$data2))))

This function is related to computed the Kolmogorow-Smirnow-Test. As this function has a cumsum in it, the order of the rows in the data.frame clearly matters. The desired order is the one where the values in the rank column are strictly increasing. In the original test dataframe the rows are already ordered in this way.

My question is whether this order is preserved when using tapply. So tapply applies my function first to the sub dataframe which consists of all rows where cat="A", then to the sub dataframe where cat="B", but does thes sub dataframes have the same row order as the rows in the original dataframe where cat="A" and cat="B"?

In this example this is true but that doesn't prove that this would work out in general. So the question is how does tapply generate the sub dataframes and does this method guarantee that the row order is preserved?


Solution

  • Yes tapply() is implemented to preserve row order within each group. Note that its first argument must be an R object for which a split() method exists. This is because the tapply() source has the following line:

    ans <- split(X, group)
    

    In your example, this is the equivalent of doing split(test, c(1,1,1,2,2,2)). Note that this means that unless otherwise specified the groups themselves will be sorted lexicographically, rather than in the order they appear (so in your case A will be the first group even if B is the first row). But does split() preserve row order within groups?

    This is not guaranteed in the docs. However, if we look at the C source, we can see it iterates in order through every element of a vector (which correspond to the row positions in a data frame) and adds each to the last position of the appropriate group. Below is how this is done for an integer vector, though the logic is the same for all other types. I've removed some switch case statements and NA checks which muddle the point and added comments:

    MOD_ITERATE1(nobs, nfac, i, i1, {
        // get group index
        int j = INTEGER(f)[i1];
        // look up how many elements currently in this group
        _L_int_ k = (_L_int_)_L_INTEG_(counts)[j - 1];
        // add the current element to the last position
        INTEGER(VECTOR_ELT(vec, j - 1))[k] = INTEGER(x)[i];
        // update group count so additional elements go in next slots 
        _L_INTEG_(counts)[j - 1] += 1;
    });
    

    So order is guaranteed at least by the implementation if not by the docs. Of course that means it could change. However, if we look at the source from the 1997 implementation the logic is fundamentally the same.

    Another reassuring way to approach this is that there is the unsplit() function - meaning you can do this:

    identical(
        unsplit(
            split(mtcars, mtcars$cyl),
            mtcars$cyl
        ),
        mtcars
    )
    # [1] TRUE
    

    If row order was not preserved this would not be possible. So, yes, row order is preserved by the current implementation and it seems unlikely to me that this is going to change without warning.