rdataframetapply

Why does `tapply` give a different result depending on the parameter of FUN


I am working on data analysis and came across the following. Given a triplet data frame, consisting of indices i, j and value v, create a matrix m[i, j] = v.

A matrix element can have multiple values. These are merged using the function paste0.

idx <-
  data.frame( i   = c(  1, 2, 3, 4,  4, 1  ),
              j   = c(  4, 3, 2, 1,  1, 4  ),
              v   = c(  1, 1, 0, 0,"=", "=")
            )
a1 <- tapply(idx$v, idx[1:2], FUN = paste0, collapse="")
#    y
# x   1    2   3   4   
#   1 NA   NA  NA  "1="
#   2 NA   NA  "1" NA  
#   3 NA   "0" NA  NA  
#   4 "0=" NA  NA  NA  

Now we omit the parameter of paste0 and the result is quite different:

a2 <- tapply(idx$v, idx[1:2], FUN = paste0)
#    y
# x   1           2    3    4          
#   1 NULL        NULL NULL character,2
#   2 NULL        NULL "1"  NULL       
#   3 NULL        "0"  NULL NULL       
#   4 character,2 NULL NULL NULL  

character,2 is a list:

# [[1]]
# [1] "1" "="

Could anyone explain this behaviour? Is there an obvious method to convert a2 to a1

Edit

Q: note that in this example NA becomes "NA".

idx <-
  data.frame( i   = c(  1,  2, 3, 4),
              j   = c(  4,  3, 2, 1),
              v   = c(  1, NA, 0, 0)
            )
a3 <- tapply(idx$v, idx[1:2], FUN = paste0, collapse="")
#    j
# i   1   2   3    4  
#   1 NA  NA  NA   "1"
#   2 NA  NA  "NA" NA 
#   3 NA  "0" NA   NA 
#   4 "0" NA  NA   NA

A: paste0(NA) equals "NA" instead of NA.


Solution

  • When FUN= returns a scalar for all calls, then the default of simplify=TRUE means that it is turned into a vector, non-indexed positions are assigned NA as an indicator of missingness, and this vector is assigned dimensions (aka, it is a matrix).

    When FUN= can return a length other than 1 for a call, then the internal return value (similar to sapply) is a list. Because it is a list, the indicator of "missingness" is instead a NULL list. Now, tapply still applies dimensions to this, so technically we have a list (with a length of the product of the index lengths) where each element may be NULL or a length or 1 or more. The way R renders this on the console is as a matrix but with "list-columns".

    Looking at the structure of the return values can be helpful:

    str(tapply(idx$v, idx[1:2], FUN = paste0, collapse=""))
    #  chr [1:4, 1:4] NA NA NA "0=" NA NA "0" NA NA "1" NA NA "1=" NA NA NA
    #  - attr(*, "dimnames")=List of 2
    #   ..$ i: chr [1:4] "1" "2" "3" "4"
    #   ..$ j: chr [1:4] "1" "2" "3" "4"
    
    str(tapply(idx$v, idx[1:2], FUN = paste0))
    # List of 16
    #  $ : NULL
    #  $ : NULL
    #  $ : NULL
    #  $ : chr [1:2] "0" "="
    #  $ : NULL
    #  $ : NULL
    #  $ : chr "0"
    #  $ : NULL
    #  $ : NULL
    #  $ : chr "1"
    #  $ : NULL
    #  $ : NULL
    #  $ : chr [1:2] "1" "="
    #  $ : NULL
    #  $ : NULL
    #  $ : NULL
    #  - attr(*, "dim")= int [1:2] 4 4
    #  - attr(*, "dimnames")=List of 2
    #   ..$ i: chr [1:4] "1" "2" "3" "4"
    #   ..$ j: chr [1:4] "1" "2" "3" "4"
    

    The "list-column" rendering when a value is length-1 is the value itself, but when the length is greater than 1 it will show what type it is and how many there are, ergo character,2 where there are multiple values.

    As for a "method to convert a2 to a1", so long as the FUN= returns at least one object of length 2+, you cannot get there. You need some form of "aggregation", in this case collapse= serves that purpose.