pythongroup-bypython-itertoolsfileparsing

Can I use itertools.groupby to return groups of lines where the first line starts with a specific character?


I have a text file that looks like this:

>Start of group

text1

text2

>Start of new group

text3

I've been trying to use itertools.groupby to return groups where each group is a list of lists containing:

1) line starting with the ">" character.

2) the lines of text following the line starting with the ">" character, up to the next line starting with the ">" character.

So from the previous text, I would WANT to get:

[['>Start of group', text1, text2], ['>Start of new group', text3]]

The code I have written so far is:

with open(filename) as rfile:
    groups = []

    for key, group in groupby(rfile, lambda x: x.startswith(">")):
        groups.append(list(group))

However, this produces a list of lists where every line of the file is in its own list, like this:

[['>Start of group'],[text1],[text2],['>Start of new group'],[text3]]

I think I probably just don't understand the groupby function very well, since this is the first time I'm trying to implement it, so any explanation is appreciated.


Solution

  • Here is a way to get your data without the groupby function.

    fin = open('fasta.out', 'r')
    
    data = []
    
    for line in fin:
        line = line.rstrip()
    
        if line.startswith('>'):
            data.append([line])
        else:
            data[-1].append(line)