I have devised edited a little script that subsamples x lines from an original file. The original file is fasta, which has two lines for every sequence, and the program extracts those x number of sequences (the two lines together). This is how it looks:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
else:
fSeqs.append(line.rstrip())
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Take random items out of the list for the total number of samples required
for j in range(num):
a = random.randint(0, (len(fNames)-1))
print(fNames.pop(a), file = outfile)
print(fSeqs.pop(a), file = outfile)
infile.close()
outfile.close()
input("Done.")
The creation of the lists with the ID and The nucleotides (line 1 and 2 respectively) goes pretty quick, but the print out takes forever. The numbers being extracted can go up to 2M but it starts going slow from the 10000.
I was wondering if there is any way to make it faster. Is the .pop
the issue? Would it be faster if I created the a random list of unique numbers first and then extracted those?
Finally, the terminal does not go back to "normal finished state" after printing Done.
and I do not know why. With all my other scripts I can inmediately type when they are done.
random.sample (which was suggested in a comment) and a dictionary makes the script much faster. Here is the final script:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
Id = line.rstrip()
else:
dicfasta[Id] = line.rstrip()
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Create subsamples
subsample = []
subsample = random.sample(fNames, num)
# Take random items out of the list for the total number of samples required
for j in subsample:
print(j, file = outfile)
print(dicfasta[j], file = outfile)
infile.close()
outfile.close()
input("Done.")