I'm new to programming in R and trying to do a very specific task.
I have a fasta sequence of n samples, which I read in ape
:
library(ape)
matrix <- read.dna(myfasta, format="fasta", as.character=TRUE)
This created a matrix, like so:
| | V1 | V2 | V3 | V4 |...
|------------------------|
|Seq1| a | t | g | c |...
|Seq2| a | t | g | a |...
|Seq3| a | t | c | c |...
|Seq4| t | t | g | a |...
|... |
Where Seq(n) is the DNA sequence for each sample, and V(n) denotes nucleotide position.
How can I select the sequences that bear a certain nucleotide (e.g. "a"), at a certain position (e.g. "V1"), and then return the sequences as a concatenated string?
So for position V1, I'd want to have something like "Seq1, Seq2, Seq3" and for position V4, for the same base, I'd want to have "Seq2, Seq4"
I've tried which()
and filter(matrix, V1 == "a")
but I'm struggling.
Thanks in advance!
The simplest way is to select the V1 == 'a'
rows with boolean indexing, and then extract rownames
:
rownames(example[example[,"V1"] == "a", ]) # "No304" "No306"
You mentioned filter
, which looks like dplyr
. It's a little more cumbersome to use tidyverse methods to manipulate data for which row names are important, as row names are dropped by default.
If you wanted to use filter
, you'd have to first save the row names as an explicit column:
library(dplyr)
as.data.frame(example) %>%
mutate(sequence = rownames(.), .before = everything()) %>%
filter(V1 == "a") %>%
select(sequence)
sequence
1 No304
2 No306
Data (from ape
read.dna docs)
library(ape)
cat(">No305",
"NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
">No304",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
">No306",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
file = "exdna.fas", sep = "\n")
example <- read.dna("exdna.fas", format = "fasta", as.character = TRUE)
colnames(example) <- paste0("V", 1:ncol(example))
example
V1 V2 V3 V4 ...
No305 "n" "t" "t" "c"
No304 "a" "t" "t" "c"
No306 "a" "t" "t" "c"