I am working on data analysis and came across the following.
Given a triplet data frame, consisting of indices i, j and value v, create a matrix m[i, j] = v
.
A matrix element can have multiple values. These are merged using the function paste0
.
idx <-
data.frame( i = c( 1, 2, 3, 4, 4, 1 ),
j = c( 4, 3, 2, 1, 1, 4 ),
v = c( 1, 1, 0, 0,"=", "=")
)
a1 <- tapply(idx$v, idx[1:2], FUN = paste0, collapse="")
# y
# x 1 2 3 4
# 1 NA NA NA "1="
# 2 NA NA "1" NA
# 3 NA "0" NA NA
# 4 "0=" NA NA NA
Now we omit the parameter of paste0 and the result is quite different:
a2 <- tapply(idx$v, idx[1:2], FUN = paste0)
# y
# x 1 2 3 4
# 1 NULL NULL NULL character,2
# 2 NULL NULL "1" NULL
# 3 NULL "0" NULL NULL
# 4 character,2 NULL NULL NULL
character,2 is a list
:
# [[1]]
# [1] "1" "="
Could anyone explain this behaviour? Is there an obvious method to convert a2
to a1
Q: note that in this example NA becomes "NA".
idx <-
data.frame( i = c( 1, 2, 3, 4),
j = c( 4, 3, 2, 1),
v = c( 1, NA, 0, 0)
)
a3 <- tapply(idx$v, idx[1:2], FUN = paste0, collapse="")
# j
# i 1 2 3 4
# 1 NA NA NA "1"
# 2 NA NA "NA" NA
# 3 NA "0" NA NA
# 4 "0" NA NA NA
A: paste0(NA)
equals "NA"
instead of NA
.
When FUN=
returns a scalar for all calls, then the default of simplify=TRUE
means that it is turned into a vector, non-indexed positions are assigned NA
as an indicator of missingness, and this vector is assigned dimensions (aka, it is a matrix
).
When FUN=
can return a length other than 1 for a call, then the internal return value (similar to sapply
) is a list
. Because it is a list, the indicator of "missingness" is instead a NULL
list. Now, tapply
still applies dimensions to this, so technically we have a list
(with a length of the product of the index lengths) where each element may be NULL
or a length or 1 or more. The way R renders this on the console is as a matrix but with "list-columns".
Looking at the str
ucture of the return values can be helpful:
str(tapply(idx$v, idx[1:2], FUN = paste0, collapse=""))
# chr [1:4, 1:4] NA NA NA "0=" NA NA "0" NA NA "1" NA NA "1=" NA NA NA
# - attr(*, "dimnames")=List of 2
# ..$ i: chr [1:4] "1" "2" "3" "4"
# ..$ j: chr [1:4] "1" "2" "3" "4"
str(tapply(idx$v, idx[1:2], FUN = paste0))
# List of 16
# $ : NULL
# $ : NULL
# $ : NULL
# $ : chr [1:2] "0" "="
# $ : NULL
# $ : NULL
# $ : chr "0"
# $ : NULL
# $ : NULL
# $ : chr "1"
# $ : NULL
# $ : NULL
# $ : chr [1:2] "1" "="
# $ : NULL
# $ : NULL
# $ : NULL
# - attr(*, "dim")= int [1:2] 4 4
# - attr(*, "dimnames")=List of 2
# ..$ i: chr [1:4] "1" "2" "3" "4"
# ..$ j: chr [1:4] "1" "2" "3" "4"
The "list-column" rendering when a value is length-1 is the value itself, but when the length is greater than 1 it will show what type it is and how many there are, ergo character,2
where there are multiple values.
As for a "method to convert a2 to a1", so long as the FUN=
returns at least one object of length 2+, you cannot get there. You need some form of "aggregation", in this case collapse=
serves that purpose.