I have aligned 2 DNAStringSet, and use the Biostring
package. I would like to extract all the portions of the sequences that match perfectly and are at least 10 nucleotides (nt) long, as well as the position of the match in both the pattern and the query sequence.
library('Biostrings')
set.seed(123)
dna_alphabet <- DNA_ALPHABET[1:4]
seq1_random <- sample(dna_alphabet, 100, replace = TRUE)
seq2_random <- sample(dna_alphabet, 100, replace = TRUE)
match_12 <- sample(dna_alphabet, 12, replace = TRUE)
match_15 <- sample(dna_alphabet, 15, replace = TRUE)
match_23 <- sample(dna_alphabet, 23, replace = TRUE)
seq1_random[15:(15+11)] <- match_12
seq2_random[17:(17+11)] <- match_12
seq1_random[41:(41+14)] <- match_15
seq2_random[40:(40+14)] <- match_15
seq1_random[70:(70+22)] <- match_23
seq2_random[73:(73+22)] <- match_23
seq1 <- DNAString(paste(seq1_random, collapse = ""))
seq2 <- DNAString(paste(seq2_random, collapse = ""))
alignment <- pairwiseAlignment(pattern = seq1, subject = seq2, type = "global")
alignment
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: GGGCGCCCGATCCA--CCTATTCGTTTGTCGCACGTCAGGATATTTCACGTGAGTCATCACAAT----TGACAAGACGACCATTTCTAACATGAATTCACTAACGG
subject: ATCATCAGGTTCTGACCCTATTCGTTTGA---ATGCCCTCCCATTTCACGTGAGTCAAGGGAGGCCCGTGGCACTACGACCATTTCTAACATGAATTCGGTAT---
score: -138.1462
> df_results
perfect_match start_pos_pattern end_pos_pattern start_pos_subject end_pos_subject
1 CCTATTCGTTTG 15 26 17 28
2 ATTTCACGTGAGTCA 41 55 40 54
3 ACGACCATTTCTAACATGAATTC 70 92 73 95
My actual alignment is >15,000 nt long, so I need to find an efficient way to proceed. I am more familiar with the {tidyverse}
than {data.table}
options, if possible.
What I found so far is: Extract the aligned sequences:
aligned_pattern_str <- as.character(alignment@pattern)
aligned_pattern_str
[1] "GGGCGCCCGATCCA--CCTATTCGTTTGTCGCACGTCAGGATATTTCACGTGAGTCATCACAAT----TGACAAGACGACCATTTCTAACATGAATTCACTAA"
aligned_subject_str <- as.character(alignment@subject)
aligned_subject_str
[1] "ATCATCAGGTTCTGACCCTATTCGTTTGA---ATGCCCTCCCATTTCACGTGAGTCAAGGGAGGCCCGTGGCACTACGACCATTTCTAACATGAATTCGGTAT"
Then: I use a rolling window to compare the first character in aligned_pattern_str and the first character in aligned_subject_str. If they don´t match, I compare the second of each etc until they match. Once they match: I keep comparing the next characters until there is a mismtach. Then, if the match is >10 nucleotide, I report it, otherwise Idont. And I keep proceeding until the entire sequences have been covered.
So my code to that step:
match_start <- NULL
match_length <- 0
matches <- list()
for (i in 1:nchar(aligned_pattern_str)) {
if (substr(aligned_pattern_str, i, i) == substr(aligned_subject_str, i, i) && substr(aligned_pattern_str, i, i) != "-") && {
if (is.null(match_start)) {
match_start <- i
}
match_length <- match_length + 1
} else {
if (!is.null(match_start) && match_length > 10) {
matches[[length(matches) + 1]] <- list(start = match_start, end = i - 1, length = match_length)
}
match_start <- NULL
match_length <- 0
}
}
if (!is.null(match_start) && match_length > 10) {
matches[[length(matches) + 1]] <- list(start = match_start, end = nchar(aligned_pattern_str), length = match_length)
}
for (match in matches) {
cat(sprintf("Match from %d to %d, Length: %d\n", match$start, match$end, match$length))
}
>
Match from 17 to 28, Length: 12
Match from 43 to 57, Length: 15
Match from 76 to 98, Length: 23
It is not the final answer:
Here is what I eventually found. It might not be the most efficient way to go,especially over long alignments, and it uses a loop, which I was told to avoid in R. But it gets the job done:
First, I extract the aligned sequences:
> aligned_pattern_str <- as.character(alignment@pattern)
> aligned_pattern_str
[1] "GGGCGCCCGATCCA--CCTATTCGTTTGTCGCACGTCAGGATATTTCACGTGAGTCATCACAAT----TGACAAGACGACCATTTCTAACATGAATTCACTAA"
> aligned_subject_str <- as.character(alignment@subject)
> aligned_subject_str
[1] "ATCATCAGGTTCTGACCCTATTCGTTTGA---ATGCCCTCCCATTTCACGTGAGTCAAGGGAGGCCCGTGGCACTACGACCATTTCTAACATGAATTCGGTAT"
I create 3 variables:
match_start <- NULL
match_length <- 0
matches <- list()
Then I create a function to correct the positions with the number of gaps:
count_gaps_before <- function(seq, pos) {
sub_seq <- substr(seq, 1, pos)
char_vec <- strsplit(sub_seq, "")[[1]]
gap_count <- sum(char_vec == "-")
return(gap_count)
}
And I loop over the aligned sequences:
for (i in 1:nchar(aligned_pattern_str)) {
if (substr(aligned_pattern_str, i, i) == substr(aligned_subject_str, i, i) && substr(aligned_pattern_str, i, i) != "-") {
if (is.null(match_start)) {
match_start <- i
}
match_length <- match_length + 1
} else {
if (!is.null(match_start) && match_length >= 10) {
pattern_gaps <- count_gaps_before(aligned_pattern_str, match_start)
subject_gaps <- count_gaps_before(aligned_subject_str, match_start)
matching_sequence <- substr(aligned_pattern_str, match_start, match_start + match_length - 1)
matching_sequence <- gsub("-", "", matching_sequence) # Remove gaps from the matching sequence
matches[[length(matches) + 1]] <- list(
start_pattern = match_start - pattern_gaps,
end_pattern = (match_start + match_length - 1) - count_gaps_before(aligned_pattern_str, match_start + match_length - 1),
start_subject = match_start - subject_gaps,
end_subject = (match_start + match_length - 1) - count_gaps_before(aligned_subject_str, match_start + match_length - 1),
length = match_length,
sequence = matching_sequence
)
}
match_start <- NULL
match_length <- 0
}
}
if (!is.null(match_start) && match_length >= 10) {
pattern_gaps <- count_gaps_before(aligned_pattern_str, match_start)
subject_gaps <- count_gaps_before(aligned_subject_str, match_start)
matching_sequence <- substr(aligned_pattern_str, match_start, nchar(aligned_pattern_str))
matching_sequence <- gsub("-", "", matching_sequence)
matches[[length(matches) + 1]] <- list(
start_pattern = match_start - pattern_gaps,
end_pattern = nchar(aligned_pattern_str) - count_gaps_before(aligned_pattern_str, nchar(aligned_pattern_str)),
start_subject = match_start - subject_gaps,
end_subject = nchar(aligned_subject_str) - count_gaps_before(aligned_subject_str, nchar(aligned_subject_str)),
length = match_length,
sequence = matching_sequence
)
}
I then print the list of matches:
for (match in matches) {
cat(sprintf("Match from %d to %d in pattern, from %d to %d in subject, Length: %d, Sequence: %s\n",
match$start_pattern, match$end_pattern, match$start_subject, match$end_subject, match$length, match$sequence))
}
And I transform the list into a data frame:
matches_df <- do.call(rbind, lapply(matches, function(x) {
data.frame(
sequence = x$sequence,
from_in_pattern = x$start_pattern,
to_in_pattern = x$end_pattern,
from_in_subject = x$start_subject,
to_in_subject = x$end_subject,
stringsAsFactors = FALSE # Ensure character data remains as character type
)
}))
The results is:
> matches_df
sequence from_in_pattern to_in_pattern from_in_subject to_in_subject
1 CCTATTCGTTTG 15 26 17 28
2 ATTTCACGTGAGTCA 41 55 40 54
3 ACGACCATTTCTAACATGAATTC 70 92 73 95