So, i have a dataframe of the type:
Doc | String |
---|---|
A | abc |
A | def |
A | ghi |
B | jkl |
B | mnop |
B | qrst |
B | uv |
What I'm trying to do is to merge/collpase rows according to a two conditions:
I have
So that, for example if I will get max_len == 6:
Doc | String |
---|---|
A | abcdef |
A | defghi |
B | jkl |
B | mnop |
B | qrstuv |
he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.
I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:
def group(col, max_len=6):
groups = []
group = acc = 0
for length in col.values:
acc += length
if max_len < acc:
group, acc = group + 1, length
groups.append(group)
return groups
groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)
The group
function takes a column of string lengths for a Doc
group and builds groups
that meet the max_len
condition. Based on that another groupby
over Doc
and groups
then aggregates the strings.
Result for the sample:
Doc String
0 A abcdef
1 A ghi
2 B jkl
3 B mnop
4 B qrstuv