I have a single column dataframe similar to this:
cat = { 'cat': ['a','b','c-',' -d','e']}
df = pd.DataFrame(cat)
>>> print(df)
cat
0 a
1 b
2 c-
3 -d
4 e
I need to merge lines 2 and 3:
cat
0 a
1 b
2 cd
3 e
I'm trying to use regex, since these multiline data are always of the form "-\n {8}-". However this doesn't work:
df['cat'] = df['cat'].str.replace("-\n {8}-","",flags=re.M)
I cannot use any grouping techniques, so I'm kinda stuck. Thank you for any suggestions.
You can use a groupby
approach by identifying the strings that end with -
or start with {8}-
:
m1 = df['cat'].str.endswith('-')
m2 = df['cat'].str.match(' {8}-')
group = (~(m1.shift(fill_value=False) & m2)).cumsum().rename()
out = (df['cat'].groupby(group)
.agg(lambda x: ''.join(x).replace('- -', ''))
.to_frame()
)
Output:
cat
1 a
2 b
3 cd
4 e
Intermediates:
cat m1 m2 group
0 a False False 1
1 b False False 2
2 c- True False 3
3 -d False True 3
4 e False False 4
Or without pandas, join
all strings with a custom separator (e.g. \n
, replace then split
again):
import re
out = pd.DataFrame({'col': re.sub(r'-\n {8}-', '', '\n'.join(df['cat']))
.split('\n')})
Output:
col
0 a
1 b
2 cd
3 e