Using R, I'd like to write a "reverse translation" function that takes as input a string of amino acids and for each one returns a triplet codon that encodes that amino acid. The codon would be randomly chosen from a frequency table that I will initially hard-code the values for, but I would also like to eventually be able to adjust the frequencies so that "rarer" codons will be more likely to be chosen.
So far I have tried to define a "codon" class using the 61 codons that encode an amino acid in the standard genetic code.
# Define codon class
codon <- function(sequence, aminoacid, frequency) {
new_codon <- list(sequence = sequence, aminoacid = aminoacid, frequency = frequency)
class(new_codon) <- "codon"
new_codon
}
# Define the 61 codon frequency table
# structure: nucleotide sequence, amino acid it encodes, frequency codon is used
codons <- list(
# Alanine (Ala) codons
codon("GCT", "A", 0.26),
codon("GCC", "A", 0.40),
codon("GCA", "A", 0.23),
codon("GCG", "A", 0.11),
# Cysteine (Cys) codons
codon("TGT", "C", 0.45),
codon("TGC", "C", 0.55),
# Aspartic acid (Asp) codons
codon("GAT", "D", 0.46),
codon("GAC", "D", 0.54),
# Glutamic acid (Glu) codons
codon("GAA", "E", 0.42),
codon("GAG", "E", 0.58),
# Phenylalanine (Phe) codons
codon("TTT", "F", 0.45),
codon("TTC", "F", 0.55),
# Glycine (Gly) codons
codon("GGT", "G", 0.16),
codon("GGC", "G", 0.34),
codon("GGA", "G", 0.25),
codon("GGG", "G", 0.25),
# Histidine (His) codons
codon("CAT", "H", 0.41),
codon("CAC", "H", 0.59),
# Isoleucine (Ile) codons
codon("ATT", "I", 0.36),
codon("ATC", "I", 0.48),
codon("ATA", "I", 0.16),
# Lysine (Lys) codons
codon("AAA", "K", 0.42),
codon("AAG", "K", 0.58),
# Leucine (Leu) codons
codon("TTA", "L", 0.07),
codon("TTG", "L", 0.13),
codon("CTT", "L", 0.13),
codon("CTC", "L", 0.20),
codon("CTA", "L", 0.07),
codon("CTG", "L", 0.41),
# Methionine (Met) codon
codon("ATG", "M", 1.0),
# Asparagine (Asn) codons
codon("AAT", "N", 0.46),
codon("AAC", "N", 0.54),
# Proline (Pro) codons
codon("CCT", "P", 0.28),
codon("CCC", "P", 0.33),
codon("CCA", "P", 0.27),
codon("CCG", "P", 0.11),
# Glutamine (Gln) codons
codon("CAA", "Q", 0.25),
codon("CAG", "Q", 0.75),
# Arginine (Arg) codons
codon("CGT", "R", 0.08),
codon("CGC", "R", 0.19),
codon("CGA", "R", 0.11),
codon("CGG", "R", 0.21),
codon("AGA", "R", 0.20),
codon("AGG", "R", 0.20),
# Serine (Ser) codons
codon("TCT", "S", 0.18),
codon("TCC", "S", 0.22),
codon("TCA", "S", 0.15),
codon("TCG", "S", 0.06),
codon("AGT", "S", 0.15),
codon("AGC", "S", 0.24),
# Threonine (Thr) codons
codon("ACT", "T", 0.24),
codon("ACC", "T", 0.36),
codon("ACA", "T", 0.28),
codon("ACG", "T", 0.12),
# Valine (Val) codons
codon("GTT", "V", 0.18),
codon("GTC", "V", 0.24),
codon("GTA", "V", 0.11),
codon("GTG", "V", 0.47),
# Tryptophan (Trp) codon
codon("TGG", "W", 1.0),
# Tyrosine (Tyr) codons
codon("TAT", "Y", 0.43),
codon("TAC", "Y", 0.57)
)
The function I tried below:
# Define function to generate new sequence
random_sequence <- function(protein_sequence) {
# Initialize empty list to store new sequence
new_sequence <- list()
# Iterate over each amino acid in the protein sequence
for (aa in protein_sequence) {
# Get all codons that encode the current amino acid
aa_codons <- codons[sapply(codons, function(x) x$aminoacid == aa)]
# Get the frequencies of the codons
aa_freqs <- sapply(aa_codons, function(x) x$frequency)
# Normalize the frequencies to sum to 1
aa_freqs <- aa_freqs / sum(aa_freqs)
# Select a random codon based on the frequencies
new_codon <- aa_codons[sample(1:length(aa_codons), 1, prob = aa_freqs)]
# Add the new codon to the new sequence
new_sequence <- c(new_sequence, new_codon[[1]][1]$sequence)
}
# Return the new sequence
new_sequence <- paste(new_sequence, collapse = "")
return(new_sequence)
}
# Example usage
protein_sequence <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
random_sequence(protein_sequence)
When I run each line of this function individually I get the results I expect. Also, if I set the protein_sequence to be one letter long, it also works. However, when I try to run the function as written, I get the error message:
Error in sum(aa_freqs) : invalid 'type' (list) of argument
I tried to fix this by removing the normalize frequencies to sum to 1 safety check (# aa_freqs <- aa_freqs / sum(aa_freqs)
. But now I get a different error:
Error in sample.int(length(x), size, replace, prob) : incorrect number of probabilities
I am not sure how to troubleshoot the bug given that when I run through the function line by line with a one letter test amino acid, it all seems to work fine.
Any assistance or insight would be greatly appreciated! Thank you very much for your help!
It looks like your function is trying to parse the whole amino acid sequence at once, rather than letter-by-letter. You can use strsplit
or equivalent to split the passed string into component parts and then more or less do as you have indicated:
ReverseTranscribe <- function(protein_sequence) {
paste(sapply(unlist(strsplit(protein_sequence, "")), function(cur_amino) {
# Extract possible codons
potential_matches <- codons[sapply(codons, function(x) {x$aminoacid == cur_amino})]
# Convert to a weights table
weights_table <- data.frame(matrix(sapply(potential_matches, unlist), ncol = 3, byrow = TRUE))
# Weighted sample
chosen_codon <- sample(weights_table[[1]], 1, prob = weights_table[[3]])
return(chosen_codon)
}), collapse = "-")
}
example <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
> for (i in 1:3) {
+ print(ReverseTranscribe(example))
+ }
[1] "ATG-GCT-GTC-GTG-GAT-CTC-AAG-GAA-TGT-GAG-ATC-CTG-CAT-ACC-TGG-GTT-TTC-CCC-AAG-AAA-AAG-CAC-GGT-GAG-GCT-CGG-AAT-GAC-GAT-TGC-CAG-CAG-GGC-AAG-CCT-AGC-ACC"
[1] "ATG-GCC-GTG-GTG-GAT-CTG-AAG-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTC-TTT-CCT-AAG-AAG-AAA-CAT-GGC-GAG-GCC-CGT-AAT-GAC-GAC-TGC-CAA-CAA-GGG-AAG-CCA-TCA-ACA"
[1] "ATG-GCC-GTG-GTG-GAT-TTG-AAA-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTT-TTC-CCC-AAG-AAG-AAG-CAC-GGA-GAG-GCT-CGC-AAC-GAT-GAT-TGT-CAG-CAG-GGC-AAG-CCC-TCT-ACC"