pythonpandasdata-sciencesentence

Collapse together pandas row that respect a list of conditions


So, i have a dataframe of the type:

Doc String
A abc
A def
A ghi
B jkl
B mnop
B qrst
B uv

What I'm trying to do is to merge/collpase rows according to a two conditions:

I have

So that, for example if I will get max_len == 6:

Doc String
A abcdef
A defghi
B jkl
B mnop
B qrstuv

he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.


Solution

  • I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:

    def group(col, max_len=6):
        groups = []
        group = acc = 0
        for length in col.values:
            acc += length
            if max_len < acc:
                group, acc = group + 1, length
            groups.append(group)
        return groups
    
    groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
    res = df.groupby(["Doc", groups], as_index=False).agg("".join)
    

    The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.

    Result for the sample:

      Doc  String
    0   A  abcdef
    1   A     ghi
    2   B     jkl
    3   B    mnop
    4   B  qrstuv