pythonlisttuplessequencefind-occurrences

Counting the most frequent sequence occurring in a string


Full Code is here: https://nekobin.com/yehukomiya (Don't mind the computeFile() function) Hi, I've got a text file with repeating chars in binary string like "0101001001111000010010011110010000000" I already got the KMPSearch() function that I use to count how many times a given sequence occurs in my string. I also count in the ex1() function the char that occurs most in the string, but what I want to achieve is find the most frequent sequence and how many times it occurs, and return the n sequences that repeat the most. e.g. s = "01010010010001000111101100001010011001111000010010011110010000000" n = 20 I want to return the following list:

[     
      (4, ['0001', '0011', '1100' ]),
      (5, ['011', '1000', '110' ]),
      (6, ['0000', '111']),
      (7, ['0010','1001' ]),
      (8, ['0100']),
      (10,['010']),
      (11,['000', '001', '11']),
      (12,['100']),
      (15,['01','10']),
      (23,['00'])
]

Where the firs column is the m-times the sequences occurs, the following column are the actual sequences that repeat themselves m times, and the number of entry in the list is n. It returns all the occurrences because n is less than the number of the occurrences it found. but if n was for example equal to 4 I wanted the list:

[
      (11,['000', '001', '11']),
      (12,['100']),
      (15,['01','10']),
      (23,['00'])
]

At the end I would also like to sort the final tuple as shown. Thank you all for your time and patience.


Solution

  • A rather straightforward, unoptimized way to approach this is in 2 steps could be (explanation below):

    from collections import defaultdict
    
    s = "01010010010001000111101100001010011001111000010010011110010000000"
    
    min_seq_length = 2
    
    # 1. Count recurring sequences
    sequence_dict = defaultdict(lambda: 0)
    for seq_start in range(len(s)):
        for seq_end in range(seq_start + min_seq_length, len(s)+1):
            sequence = s[seq_start:seq_end]
            sequence_dict[sequence] += 1
    
    # 2. Group sequences by number of occurrences
    grouped_sequences_dict = defaultdict(list)
    for sequence, count in sequence_dict.items():
        if count > 1:
            grouped_sequences_dict[count].append(sequence)
    
    # 3. Sort and filter grouped sequences
    sorted_sequences = sorted(list(grouped_sequences_dict.items()))
    
    top_n_results = 4
    top_sequences = sorted_sequences[-top_n_results:]
    
    for seq in top_sequences:
        print(seq)
    

    output

    (11, ['001', '000', '11'])
    (12, ['100'])     
    (15, ['01', '10'])
    (23, ['00']) 
    

    explanation:

    1. Count recurring sequences by iterating over all substrings of at least 2 characters and building a dictionary to keep track how many times every unique sequences occurs:
      • sequence_dict's key is a unique sequence (e.g. '00')
      • sequence_dict's value is the amount of occurrences
    2. Group sequences by number of occurrences by iterating over all above sequences and building a dictionary that keeps a list of all sequences, per amount of occurrences:
      • grouped_sequences_dict's key is the amount of occurrences
      • grouped_sequences_dict's value is a list of sequences that occur this amount of times
    3. Sort and filter grouped sequences
      • convert above dictionary to a list of (key,value) pairs
      • sort it (default sorting of a tuple will be done on the first item of the tuple, which is the key of above dictionary, being the amount of times a sequence occurs
      • take the last n (4) items

    Could probably be optimized/simplified with a pandas.

    [edit] your code

    My apologies, I just realized you provided a whole bunch of code already. It looks like you are already covering some aspects. Maybe an additional general remark: try to write the code as pythonic as possible, e.g.:

    This will make your code much more readable for yourself and for others