I'm working with a bioinformatics tool called OrthoFinder, which puts genes from different genomes/species into what are called Orthogroups based on sequences similarity.
One of the output files tells you which genes are in which Orthogroup. Here, the first column has the Orthogroup ID. The next three columns represent three genomes. In the genome columns, each row contains the IDs of genes that fit in each Orthogroup. If there are multiple genes within an Orthogroup for each genome, than the algorithm puts the genes in a list separated by commas.
Orthogroup | genome1 | genome2 | genome3 |
---|---|---|---|
OG0000001 | gene1, gene2, gene3 | gene1 | |
OG0000002 | gene5, gene6 | gene4 | |
OG0000003 | gene7 | gene7, gene8 |
I need a way to separate elements in each column but not drop rows if there are a different number of genes in each row. In other words, each gene should have its own row per genome.
I found information for how to do it with strsplit()
and tidyr::seperate_longer_delim()
. The issue I'm having is that that each column has a different number of genes. In my table above, you can see in the second row, OG0000002, that genome1 is NA. So, if I use strplit()
based on genome1, I lose OG0000002, along with the gene IDs in genome2 and genome3.
When I tried:
separate_longer_delim(c(genome1, genome2, genome3), delim = ",")
I get an error like this:
Error in
separate_longer_delim()
: ! In row 2, can't recycle input of size 2 to size 5.
Based on your sample data, the following solutions will work and should scale up to your full dataset if the data are consistent across all columns.
Using tidyverse
:
library(dplyr)
library(tidyr)
# Your sample data
df <- data.frame(Orthogroup = c("OG0000001", "OG0000002", "OG0000003"),
genome1 = c("gene1, gene2, gene3", NA, "gene1"),
genome2 = c(NA, "gene5, gene6", "gene4"),
genome3 = c("gene7", "gene7, gene8", NA))
df1 <- df |>
pivot_longer(cols = starts_with("genome"),
names_to = "genome",
values_to = "genes") |>
separate_longer_delim(genes, delim = ", ") |>
mutate(row_id = row_number(), .by = c(Orthogroup, genome)) |>
complete(Orthogroup, genome, row_id) |>
pivot_wider(names_from = genome,
values_from = genes) |>
select(-row_id)
df1
# # A tibble: 9 × 4
# Orthogroup genome1 genome2 genome3
# <chr> <chr> <chr> <chr>
# 1 OG0000001 gene1 NA gene7
# 2 OG0000001 gene2 NA NA
# 3 OG0000001 gene3 NA NA
# 4 OG0000002 NA gene5 gene7
# 5 OG0000002 NA gene6 gene8
# 6 OG0000002 NA NA NA
# 7 OG0000003 gene1 gene4 NA
# 8 OG0000003 NA NA NA
# 9 OG0000003 NA NA NA
Alternatively, you can use the cSplit()
function from the splitstackshape
package:
library(splitstackshape)
df1 <- data.frame(cSplit(df,
names(df[2:ncol(df)]), # All columns except the first one
",", # Delimiter
"long", # Rowise/long format
type.convert = "as.character")) # Value type