Suppose I have a data frame with id A
and a couple of factor variables.
library(data.table)
library(dplyr)
mydf <- data.table(
A = as.factor(c("A", "B", "C", "D","E")),
V1 = as.factor(c("x", "x", "y", "x", "y")),
V2 = as.factor(c("u", "v", "w", "v", "u"))
)
In a real example, there can be more columns (say 1000) and the dataset has a couple of million rows, so I really care about performance.
I want to convert it into the following data.frame:
A V1_x V1_y V2_u V2_v V2_w
1 A 1 0 1 0 0
2 B 1 0 0 1 0
3 C 0 1 0 0 1
4 D 1 0 0 1 0
5 E 0 1 1 0 0
What would be the efficient method to implement this conversion?
The way I do it right now is to apply dcast multiple times for each column separately. And rename columns accordingly.
f <- function(x) as.integer(length(x) > 0)
mydf2 =
mydf %>%
dcast(... ~ V1, fun.aggregate = f, value.var = "V1") %>%
dcast(... ~ V2, fun.aggregate = f, value.var = "V2")
colnames(mydf2) = c("A", "V1_x", "V1_y", "V2_u", "V2_v", "V2_w")
That does not seem efficient for many variables if we run it in a loop. In addition, renaming the variables at each step is not very robust if the number of factors varies from variable to variable.
Loop version for the same solution:
mydf2 = mydf
for(tmp_var in c("V1","V2")){
ncol_before = ncol(mydf2)
mydf2 =
mydf2 %>%
dcast( as.formula( sprintf("... ~ %s", tmp_var) ),
fun.aggregate = f, value.var = tmp_var)
ncol_after = ncol(mydf2)
# assign correct names to created vars
colnames(mydf2)[ncol_before:ncol_after ] = paste0(tmp_var, "_", colnames(mydf2)[ncol_before:ncol_after ])
}
mydf2
Here is an option with cSplit_e
library(splitstackshape)
cSplit_e(mydf, 'V1', type = 'character', fill = '0') %>%
cSplit_e('V2', type = 'character', fill = '0')
# A V1 V2 V1_x V1_y V2_u V2_v V2_w
#1: A x u 1 0 1 0 0
#2: B x v 1 0 0 1 0
#3: C y w 0 1 0 0 1
#4: D x v 1 0 0 1 0
#5: E y u 0 1 1 0 0
Or with table
from base R
do.call(cbind, lapply(2:3, function(i) table(mydf$A, mydf[[i]])))
Or the same approach in data.table
syntax
nm1 <- names(mydf)[-1]
out <- mydf[, lapply(.SD, function(x)
as.data.frame.matrix(table(A, x))), .SDcols = nm1]
mydf[, names(out) := out][]
# A V1 V2 V1.x V1.y V2.u V2.v V2.w
#1: A x u 1 0 1 0 0
#2: B x v 1 0 0 1 0
#3: C y w 0 1 0 0 1
#4: D x v 1 0 0 1 0
#5: E y u 0 1 1 0 0