rapply

Apply returns different values depending on which columns are included


If I have a data frame like this:

df <- cbind.data.frame(c("a", "b", "a", "b", "b"), c(1,0,0,1,0), c(0, NA, 0, 0, 1))

What should I do to return 1 for column 3 regardless of whether I've included the character column?

apply(df, 2, FUN = function(x){sum(x == 1 & !is.na(x))})

Returns 0 for column 3

apply(df[,2:3], 2, FUN = function(x){sum(x == 1 & !is.na(x))})

Returns 1 for column 3


Solution

  • An explanation why apply on the whole data set gives different results compared to the subset (df <> df[,2:3]).

    See how apply treats the given data if it's heterogeneous (character and numeric)

    apply(df, 2, FUN = function(x) x)
         c("a", "b", "a", "b", "b") c(1, 0, 0, 1, 0) c(0, NA, 0, 0, 1)
    [1,] "a"                        "1"              " 0"
    [2,] "b"                        "0"              NA
    [3,] "a"                        "0"              " 0"
    [4,] "b"                        "1"              " 0"
    [5,] "b"                        "0"              " 1"
    

    Since

    apply(X, MARGIN, ... expects -> X: an array, including a matrix

    and it includes the first character column the result gets cast to character (only data.frame and list can hold different data types) and the 3rd column max cell length is 2 because of the NA, all elements get extended to length 2 by padding with space (" 1", which is != 1). There is a workaround using trimws but that's overcomplicating things. Rather

    using apply on the homogeneous subset which keeps numeric

    apply(df[,2:3], 2, function(x) x)
         c(1, 0, 0, 1, 0) c(0, NA, 0, 0, 1)
    [1,]                1                 0
    [2,]                0                NA
    [3,]                0                 0
    [4,]                1                 0
    [5,]                0                 1
    

    or use sapply, since we're operating on columns anyways.