rreshapetidyrreshape2splitstackshape

Need to split a column containing varying numbers of doubly concatenated data of variable names and observations


I have a column "sample_values" with varying numbers of doubly concatenated data delimited with both "," and ":" characters. I need to make the values separated by "," into new variables (columns) and the values separated by ":" the observations of those new variables. A small subset of the problematic data.frame is shown here:

```{r}
> CDR3 <- c("CASSKGTGGPYEQYF", "CASSSDTDPSYGYTF", "CASSFGTGKNTEAFF", "CASSPRPRYYEQYF")
> sample_values <- c("sample_a:36,sample_b:24,sample_c:56", "sample_a:47", "sample_a:73,sample_b:12", "sample_c:76,sample_d:89")
> df <- data.frame(CDR3, sample_values)
> df
             CDR3                       sample_values
1 CASSKGTGGPYEQYF sample_a:36,sample_b:24,sample_c:56
2 CASSSDTDPSYGYTF                         sample_a:47
3 CASSFGTGKNTEAFF             sample_a:73,sample_b:12
4  CASSPRPRYYEQYF             sample_c:76,sample_d:8
```  

I would like to end up with the following result:

```{r}
             CDR3 sample_a sample_b sample_c sample_d
1 CASSKGTGGPYEQYF       36       24       56        0
2 CASSSDTDPSYGYTF       47        0        0        0
3 CASSFGTGKNTEAFF       73       12        0        0
4  CASSPRPRYYEQYF        0        0       76       89
```  

I will note that an absence of an observation should be interpreted as zero.

I've attempted this using various combinations of separate() and spread() from the tidyr package as well as using cSplit() from the splitstackshape package. The tidyr options failed because of differing numbers of observations to separate in the column, and the splitstackshape option failed due to insufficient memory (the unabridged data file is 485 MB in size).


Solution

  • Using tidyverse we can first bring all sample_values into individual rows, then separate column names and values into individual columns and finally spread it to wide format filling missing values with 0.

    library(tidyverse)
    
    df %>%
      separate_rows(sample_values, sep = ",") %>%
      separate(sample_values, into = c("col", "values"), sep = ":") %>%
      spread(col, values, fill = 0)
    
    
    # CDR3            sample_a sample_b sample_c sample_d
    #  <fct>           <chr>    <chr>    <chr>    <chr>   
    #1 CASSFGTGKNTEAFF 73       12       0        0       
    #2 CASSKGTGGPYEQYF 36       24       56       0       
    #3 CASSPRPRYYEQYF  0        0        76       89      
    #4 CASSSDTDPSYGYTF 47       0        0        0