I have quite a large df (50+ million) with one of the columns containing DNA sequences (1 DNA sequence per row). Some of these sequences contain a mix of lowercase and uppercase letters. I would like to have my dataset only have sequences with 50% or more uppercase letters (take out the seqs with 50% or more lowercase). I took a small subset of my DF and it took 2 minutes just to filter out the sequences. I was hoping that I could find a more efficient way so that I can scale up.
Example of my DF:
label sequence
1 aaaggGtTt...
0 AAAggccCCC...
Here is the function I am using.
def remove_low_complexity_seqs(sequence, threshold=0.5):
"""
Check if more than a given threshold proportion of the sequence is lowercase (low complexity).
Args:
- sequence (str): The nucleotide sequence.
- threshold (float): The proportion threshold (default is 0.5 for 50%).
Returns:
- bool: True if more than threshold proportion is lowercase, otherwise False.
"""
lowercase_count = sum(map(str.islower, sequence))
proportion = lowercase_count / (10000) #10k is the length of all seqs
return proportion > threshold
Code I ran:
# mask = control_seqs['sequence'].apply(lambda seq: not remove_low_complexity_seqs(seq, context)) # long runtime 115secs
# control_seqs = control_seqs[mask] # quick runtime
Assuming there are only the letters "acgtACGT", these seem 10-30 times faster (version 4 being the fastest):
Version 1:
lowercase_count = sum(map(sequence.count, 'acgt'))
Version 2:
lowercase_count = sum(map(sequence.encode().count, b'acgt'))
Version 3, with lower_to_a = bytes.maketrans(b'cgt', b'aaa')
prepared once, before your function:
lowercase_count = sequence.encode().translate(lower_to_a).count(b'a')
Version 4, with del_upper = str.maketrans('', '', 'ACGT')
prepared once, before your function:
lowercase_count = len(sequence.translate(del_upper))