Full Code is here: https://nekobin.com/yehukomiya
(Don't mind the computeFile() function)
Hi, I've got a text file with repeating chars in binary string like "0101001001111000010010011110010000000" I already got the KMPSearch() function that I use to count how many times a given sequence occurs in my string. I also count in the ex1() function the char that occurs most in the string, but what I want to achieve is find the most frequent sequence and how many times it occurs, and return the n sequences that repeat the most.
e.g.
s = "01010010010001000111101100001010011001111000010010011110010000000"
n = 20
I want to return the following list:
[
(4, ['0001', '0011', '1100' ]),
(5, ['011', '1000', '110' ]),
(6, ['0000', '111']),
(7, ['0010','1001' ]),
(8, ['0100']),
(10,['010']),
(11,['000', '001', '11']),
(12,['100']),
(15,['01','10']),
(23,['00'])
]
Where the firs column is the m-times the sequences occurs, the following column are the actual sequences that repeat themselves m times, and the number of entry in the list is n. It returns all the occurrences because n is less than the number of the occurrences it found. but if n was for example equal to 4 I wanted the list:
[
(11,['000', '001', '11']),
(12,['100']),
(15,['01','10']),
(23,['00'])
]
At the end I would also like to sort the final tuple as shown. Thank you all for your time and patience.
A rather straightforward, unoptimized way to approach this is in 2 steps could be (explanation below):
from collections import defaultdict
s = "01010010010001000111101100001010011001111000010010011110010000000"
min_seq_length = 2
# 1. Count recurring sequences
sequence_dict = defaultdict(lambda: 0)
for seq_start in range(len(s)):
for seq_end in range(seq_start + min_seq_length, len(s)+1):
sequence = s[seq_start:seq_end]
sequence_dict[sequence] += 1
# 2. Group sequences by number of occurrences
grouped_sequences_dict = defaultdict(list)
for sequence, count in sequence_dict.items():
if count > 1:
grouped_sequences_dict[count].append(sequence)
# 3. Sort and filter grouped sequences
sorted_sequences = sorted(list(grouped_sequences_dict.items()))
top_n_results = 4
top_sequences = sorted_sequences[-top_n_results:]
for seq in top_sequences:
print(seq)
output
(11, ['001', '000', '11'])
(12, ['100'])
(15, ['01', '10'])
(23, ['00'])
explanation:
'00'
)grouped_sequences_dict
's key is the amount of occurrencesgrouped_sequences_dict
's value is a list of sequences that occur this amount of timesCould probably be optimized/simplified with a pandas.
My apologies, I just realized you provided a whole bunch of code already. It looks like you are already covering some aspects. Maybe an additional general remark: try to write the code as pythonic as possible, e.g.:
This will make your code much more readable for yourself and for others