I have a column "sample_values" with varying numbers of doubly concatenated data delimited with both "," and ":" characters. I need to make the values separated by "," into new variables (columns) and the values separated by ":" the observations of those new variables. A small subset of the problematic data.frame is shown here:
```{r}
> CDR3 <- c("CASSKGTGGPYEQYF", "CASSSDTDPSYGYTF", "CASSFGTGKNTEAFF", "CASSPRPRYYEQYF")
> sample_values <- c("sample_a:36,sample_b:24,sample_c:56", "sample_a:47", "sample_a:73,sample_b:12", "sample_c:76,sample_d:89")
> df <- data.frame(CDR3, sample_values)
> df
CDR3 sample_values
1 CASSKGTGGPYEQYF sample_a:36,sample_b:24,sample_c:56
2 CASSSDTDPSYGYTF sample_a:47
3 CASSFGTGKNTEAFF sample_a:73,sample_b:12
4 CASSPRPRYYEQYF sample_c:76,sample_d:8
```
I would like to end up with the following result:
```{r}
CDR3 sample_a sample_b sample_c sample_d
1 CASSKGTGGPYEQYF 36 24 56 0
2 CASSSDTDPSYGYTF 47 0 0 0
3 CASSFGTGKNTEAFF 73 12 0 0
4 CASSPRPRYYEQYF 0 0 76 89
```
I will note that an absence of an observation should be interpreted as zero.
I've attempted this using various combinations of separate()
and spread()
from the tidyr
package as well as using cSplit()
from the splitstackshape
package. The tidyr
options failed because of differing numbers of observations to separate in the column, and the splitstackshape
option failed due to insufficient memory (the unabridged data file is 485 MB in size).
Using tidyverse
we can first bring all sample_values
into individual rows, then separate
column names and values into individual columns and finally spread
it to wide format filling missing values with 0.
library(tidyverse)
df %>%
separate_rows(sample_values, sep = ",") %>%
separate(sample_values, into = c("col", "values"), sep = ":") %>%
spread(col, values, fill = 0)
# CDR3 sample_a sample_b sample_c sample_d
# <fct> <chr> <chr> <chr> <chr>
#1 CASSFGTGKNTEAFF 73 12 0 0
#2 CASSKGTGGPYEQYF 36 24 56 0
#3 CASSPRPRYYEQYF 0 0 76 89
#4 CASSSDTDPSYGYTF 47 0 0 0