I have a text file that looks like this:
>Start of group
text1
text2
>Start of new group
text3
I've been trying to use itertools.groupby
to return groups where each group is a list of lists containing:
1) line starting with the ">" character.
2) the lines of text following the line starting with the ">" character, up to the next line starting with the ">" character.
So from the previous text, I would WANT to get:
[['>Start of group', text1, text2], ['>Start of new group', text3]]
The code I have written so far is:
with open(filename) as rfile:
groups = []
for key, group in groupby(rfile, lambda x: x.startswith(">")):
groups.append(list(group))
However, this produces a list of lists where every line of the file is in its own list, like this:
[['>Start of group'],[text1],[text2],['>Start of new group'],[text3]]
I think I probably just don't understand the groupby function very well, since this is the first time I'm trying to implement it, so any explanation is appreciated.
Here is a way to get your data without the groupby function.
fin = open('fasta.out', 'r')
data = []
for line in fin:
line = line.rstrip()
if line.startswith('>'):
data.append([line])
else:
data[-1].append(line)