pythonstringbioinformaticsbiopythonfasta

how to randomly add a list of sequences into a text body


This is one of my first tasks with actual Python code.

What I need to do is to import a set of (FASTA) sequences, select those with a length between 400 and 500 (base pairs) characters, and randomly pick 100 of those to be added into another (FASTA genome) text body — again at random.

One thing to consider is that I should not have any of those 100 sequences be added within another, so that they can eventually be queried for. In fact, it would be ideal to know also the position where each sequence has been added.

This is my go at the problem, again apologies for the limited coding skills, but I'm no expert with the language. Any help is much appreciated! I'm mostly struggling with the last part, as I managed to add a sequence in another, but I cannot do so for a list object, see below for code and error.

###library import
from Bio import SeqIO
import inspect
import random

###sequences handling
input_file = open("hs_sequences.txt")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
#my_dict

###compute sequence length
l = []
for i in my_dict.values():
   l.append(len(i))
#l

###select sequences based on range-length estimates of abundance
seq_of_choice = [seq for seq in s if 400 < len(seq) < 500]

###import FASTA
def fasta_reader(filename):
  from Bio.SeqIO.FastaIO import FastaIterator
  with open(filename) as handle:
    for record in FastaIterator(handle):
      yield record

def custom_print(string):
  counter=0
  res=""
  for char in string:
    if counter==60:
      print(res)
      counter=0
      res=""
      continue
    res+=char
    counter+=1

for entry in fasta_reader("hg37_chr1-1000l.fna"):
  print(str(entry.id))
  custom_print(str(entry.seq))

body = str(entry.seq)

###import full genome
#example_full = SeqIO.index("hg37_23-only.fna", "fasta")
#example_full

###randomly selects 100 sequences and adds them to the FASTA
def insert (source_str, insert_str, pos):
    return source_str[:pos] + insert_str + source_str[pos:]

i = 1
hundred_seqs = []
while i <= 100:
    hundred_seqs.append(random.choice(seq_of_choice))
    i += 1
#hundred_seqs

custom_print(insert(body, hundred_seqs, random.randint(0, len(body))))

In this case the error is

TypeError: can only concatenate str (not "list") to str

but, of course, it works if I use hundred_seqs[1]

I can provide a link to the files if necessary, the .txt is quite small but the FASTA genome isn't...

EDIT

Working code to add 100 strings from an original set (of a specified length) within a body of text, randomly and without overlapping – it seems to be doing what expected; however, I wish to print out where each one of those 100 sequences is placed in the body of text. Thanks in advance!

###randomly selects 100 strings and adds them to the FASTA
def insert (source_str, insert_str, pos):
    return source_str[:pos] + insert_str + source_str[pos:]

def get_string_text(genome, all_strings):
    string_of_choice = [string for string in all_strings if 400 < len(retro) < 500]
    hundred_strings = random.sample(string_of_choice, k=100)

    text_of_strings = []
    for k in range(len(hundred_strings)):
        text_of_strings.append(str(hundred_strings[k].seq))

    single_string = ",".join(text_of_strings)
    new_genome = insert(genome, single_string, random.randint(0, len(genome)))
    
    return new_genome

big_genome = get_retro_text(body, s)
#len(big_genome)
#custom_print(big_genome)

chr1_retros = "\n".join([head, big_genome])
#len(chr1_retros)
#print(chr1_retros)

Solution

  • You have to convert list of strings to one string - for example without separators

    one_string = "".join(hundred_seqs)
    

    and later you can use it

    custom_print(insert(body, one_string, random.randint(0, len(body))))
    

    Or maybe you want to put every element in different place - then you have to use for-loop.

    But this can make problem because next item can be inserted accidently inside one of previous items.

    result = body
    
    for item in hundred_seqs:
        result = insert(result, item, random.randint(0, len(result))
    
    custom_print(result)
    

    It may need to get random positions and later add them using position with the biggest number.

    positions = random.samples(range(0, len(body)), k=100)
    
    positions = sorted(positions, reverse=True)  # sorted from the biggest number to the smallest number
    
    result = body
    
    for item, pos in zip(hundred_seqs, positions):
        result = insert(result, item, pos)
    
    custom_print(result)
    

    BTW:

    It is simpler to use (one line) for-loop instead of (three lines) while-loop

    hundred_seqs = []
    for i in range(100):
        hundred_seqs.append(random.choice(seq_of_choice))
    

    But even simpler is to use choices (with char s) instead of choice (without char s)

    hundred_seqs = random.choices(seq_of_choice, k=100)
    

    But in both versions it can repeat the same values in list.

    If you don't want repeated values then better use

    hundred_seqs = random.sample(seq_of_choice, k=100)