I have a dataframe as follows:
df <- data.frame(A = c(2, 0, 1), B = c(0, 3, 2))
# A B
# 1 2 0
# 2 0 3
# 3 1 2
The number in each cell indicates the times for which the corresponding column name should repeat. The replicates should be spliced by semicolons(;
) to a single string. The expected output turns out to
# A B
# 1 A;A <NA>
# 2 <NA> B;B;B
# 3 A B;B
I'm searching a efficient way to deal with a much larger dataset:
set.seed(1234)
df <- as.data.frame(matrix(sample(0:5, 1e4*26, replace = TRUE), 1e4, 26))
names(df) <- LETTERS
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1 3 0 3 5 4 0 3 2 2 1 3 3 4 3 2 4 0 1 4 5 2 5 5 2 0 0
# 2 1 5 1 0 3 3 2 0 1 5 5 2 5 0 2 5 1 1 2 4 5 5 0 5 0 0
# 3 5 5 2 0 1 4 5 4 0 5 5 1 1 1 2 2 4 5 4 5 5 5 0 4 0 0
# ...
# [ reached 'max' / getOption("max.print") -- omitted 9997 rows ]
I prefer base
or tidyverse
solutions. data.table
is welcome but I'm unfamiliar to it though.
One base
option:
res <- df
n <- unlist(res)
res[res > 0] <- sapply(
split(rep.int(names(df)[col(df)], n), rep.int(seq_along(n), n)),
paste, collapse = ';'
)
res[res == 0] <- NA
res[1:5, 1:5]
# A B C D E
# 1 A;A;A <NA> C;C;C D;D;D;D;D E;E;E;E
# 2 A B;B;B;B;B C <NA> E;E;E
# 3 A;A;A;A;A B;B;B;B;B C;C <NA> E
# 4 A;A;A;A B;B;B;B <NA> D E;E;E;E
# 5 A;A;A B C;C;C;C D;D E;E;E