pythonpandasregexdataframemultiline

Merge multiline rows in pandas dataframe based on regex pattern


I have a single column dataframe similar to this:

cat = { 'cat': ['a','b','c-','        -d','e']}
df = pd.DataFrame(cat)

>>> print(df)
          cat
0           a
1           b
2          c-
3          -d
4           e

I need to merge lines 2 and 3:

          cat
0           a
1           b
2          cd
3          e

I'm trying to use regex, since these multiline data are always of the form "-\n {8}-". However this doesn't work:

df['cat'] = df['cat'].str.replace("-\n {8}-","",flags=re.M)

I cannot use any grouping techniques, so I'm kinda stuck. Thank you for any suggestions.


Solution

  • You can use a groupby approach by identifying the strings that end with - or start with {8}-:

    m1 = df['cat'].str.endswith('-')
    m2 = df['cat'].str.match(' {8}-')
    group = (~(m1.shift(fill_value=False) & m2)).cumsum().rename()
    
    out = (df['cat'].groupby(group)
           .agg(lambda x: ''.join(x).replace('-        -', ''))
           .to_frame()
          )
    

    Output:

      cat
    1   a
    2   b
    3  cd
    4   e
    

    Intermediates:

              cat     m1     m2  group
    0           a  False  False      1
    1           b  False  False      2
    2          c-   True  False      3
    3          -d  False   True      3
    4           e  False  False      4
    

    Or without pandas, join all strings with a custom separator (e.g. \n, replace then split again):

    import re
    
    out = pd.DataFrame({'col': re.sub(r'-\n {8}-', '', '\n'.join(df['cat']))
                                 .split('\n')})
    

    Output:

      col
    0   a
    1   b
    2  cd
    3   e