tidyr

Splitting columns with delimited strings that have different number of elements into rows


I'm working with a bioinformatics tool called OrthoFinder, which puts genes from different genomes/species into what are called Orthogroups based on sequences similarity.

One of the output files tells you which genes are in which Orthogroup. Here, the first column has the Orthogroup ID. The next three columns represent three genomes. In the genome columns, each row contains the IDs of genes that fit in each Orthogroup. If there are multiple genes within an Orthogroup for each genome, than the algorithm puts the genes in a list separated by commas.

Orthogroup genome1 genome2 genome3
OG0000001 gene1, gene2, gene3 gene1
OG0000002 gene5, gene6 gene4
OG0000003 gene7 gene7, gene8

I need a way to separate elements in each column but not drop rows if there are a different number of genes in each row. In other words, each gene should have its own row per genome.

I found information for how to do it with strsplit() and tidyr::seperate_longer_delim(). The issue I'm having is that that each column has a different number of genes. In my table above, you can see in the second row, OG0000002, that genome1 is NA. So, if I use strplit() based on genome1, I lose OG0000002, along with the gene IDs in genome2 and genome3.

When I tried:

separate_longer_delim(c(genome1, genome2, genome3), delim = ",")

I get an error like this:

Error in separate_longer_delim(): ! In row 2, can't recycle input of size 2 to size 5.

Here's what Hadley Wickham had to say about this error.


Solution

  • Based on your sample data, the following solutions will work and should scale up to your full dataset if the data are consistent across all columns.

    Using tidyverse:

    library(dplyr)
    library(tidyr)
    
    # Your sample data
    df <- data.frame(Orthogroup = c("OG0000001", "OG0000002", "OG0000003"),
                     genome1 = c("gene1, gene2, gene3", NA, "gene1"),
                     genome2 = c(NA, "gene5, gene6", "gene4"),
                     genome3 = c("gene7", "gene7, gene8", NA))
    
    df1 <- df |>
      pivot_longer(cols = starts_with("genome"), 
                   names_to = "genome", 
                   values_to = "genes") |>
      separate_longer_delim(genes, delim = ", ") |>
      mutate(row_id = row_number(), .by = c(Orthogroup, genome)) |>
      complete(Orthogroup, genome, row_id) |>
      pivot_wider(names_from = genome, 
                  values_from = genes) |>
      select(-row_id)
    
    df1
    # # A tibble: 9 × 4
    #   Orthogroup genome1 genome2 genome3
    #   <chr>      <chr>   <chr>   <chr>  
    # 1 OG0000001  gene1   NA      gene7  
    # 2 OG0000001  gene2   NA      NA     
    # 3 OG0000001  gene3   NA      NA     
    # 4 OG0000002  NA      gene5   gene7  
    # 5 OG0000002  NA      gene6   gene8  
    # 6 OG0000002  NA      NA      NA     
    # 7 OG0000003  gene1   gene4   NA     
    # 8 OG0000003  NA      NA      NA     
    # 9 OG0000003  NA      NA      NA
    

    Alternatively, you can use the cSplit() function from the splitstackshape package:

    library(splitstackshape)
    
    df1 <- data.frame(cSplit(df,
                             names(df[2:ncol(df)]), # All columns except the first one 
                             ",", # Delimiter
                             "long", # Rowise/long format
                             type.convert = "as.character")) # Value type