pythonperformancefastasubsampling

How to speed up (fasta) subsampling program for Python?


I have devised edited a little script that subsamples x lines from an original file. The original file is fasta, which has two lines for every sequence, and the program extracts those x number of sequences (the two lines together). This is how it looks:

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
    else:
        fSeqs.append(line.rstrip())

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Take random items out of the list for the total number of samples required
for j in range(num):
    a = random.randint(0, (len(fNames)-1))
    print(fNames.pop(a), file = outfile)
    print(fSeqs.pop(a), file = outfile)

infile.close()
outfile.close()
input("Done.")

The creation of the lists with the ID and The nucleotides (line 1 and 2 respectively) goes pretty quick, but the print out takes forever. The numbers being extracted can go up to 2M but it starts going slow from the 10000.

I was wondering if there is any way to make it faster. Is the .pop the issue? Would it be faster if I created the a random list of unique numbers first and then extracted those?

Finally, the terminal does not go back to "normal finished state" after printing Done. and I do not know why. With all my other scripts I can inmediately type when they are done.


Solution

  • random.sample (which was suggested in a comment) and a dictionary makes the script much faster. Here is the final script:

    #!/usr/bin/env python3
    import random
    import sys
    # How many random sequences do you want?
    num = int(input("Enter number of random sequences to select:\n"))
    
    # Import arguments
    infile = open(sys.argv[1], "r")
    outfile = open(sys.argv[2], "w")
    
    # Define list and dictionary
    fNames = []
    dicfasta = {}
    # Extract fasta file into the two lists
    for line in infile:
        if line.startswith(">"):
            fNames.append(line.rstrip())
            Id = line.rstrip()
        else:
            dicfasta[Id] = line.rstrip()
    
    # Print total number of sequences in the original file
    print("There are "+str(len(fNames))+" in the input file")
    
    # Create subsamples
    subsample = []
    subsample = random.sample(fNames, num)
    
    # Take random items out of the list for the total number of samples required
    for j in subsample:
        print(j, file = outfile)
        print(dicfasta[j], file = outfile)
    
    infile.close()
    outfile.close()
    input("Done.")