pythonpandas

How to conditionally extract strings using different regex rules depending on column value in Python


I'm trying to extract strings using different regex rules conditional on some column value in Python. For example, given the following dataframe:

                 col1   col2
1          John Smith  First
2          Jane Smith  First
3 Pritchard James Doe Second
4    Helen Joanne Doe Second
5         Walker Jean   Last
6         Hall Jensen   Last

I want to obtain:

                 col1   col2   col3
1          John Smith  First   John
2          Jane Smith  First   Jane
3 Pritchard James Doe Second  James
4    Helen Joanne Doe Second Joanne
5         Walker Jean   Last   Jean
6         Hall Jensen   Last Jensen

In R I would do something like:

library(tidyverse)
df <- data.frame(col1 = c("John Smith", "Jane Smith", "Pritchard James Doe", "Helen Joanne Doe", "Walker Jean", "Hall Jensen"),
                 col2 = c("First", "First", "Second", "Second", "Last", "Last"))
df %>%
    mutate(col3 = case_when(col2 == "First" ~ str_extract(col1, "^[A-Za-z]+"),
                            col2 == "Second" ~ str_extract(col1, "(?<=\\s+)[A-Za-z]+"),
                            col2 == "Last" ~ str_extract(col1, "[A-Za-z]+$")))

However, I'm unsure how to achieve this in Python. I've tried to use case_when in pandas but been unsuccessful in getting it to work, attempting to use a lambda function or pd.Series.str.extract

import pandas as pd
import re
d = {'col1': ['John Smith', 'Jane Smith', 'Pritchard James Doe', 'Helen Joanne Doe', 'Walker Jean', 'Hall Jensen'],
     'col2': ['First', 'First', 'Second', 'Second', 'Last', 'Last']}
df = pd.DataFrame(d)
cl = [(df['col2'] == 'First', lambda x: re.search('(^[A-Za-z]+)', x).group()),
      (df['col2'] == 'Second', lambda x: re.search(r'(?<=\s)([A-Za-z]+)', x).group()),
      (df['col2'] == 'Last', lambda x: re.search('([A-Za-z]+$)', x).group())]
df.assign(col3 = df['col1'].case_when(cl))

Solution

  • Use groupby with a dictionary of the patterns to extract:

    d = {'First': r'(^[A-Za-z]+)',
         'Second': r'(?<=\s)([A-Za-z]+)',
         'Last': r'([A-Za-z]+$)',
        }
    
    default_regex = r'()'
    
    df['col3'] = (df.groupby('col2', group_keys=False)['col1']
                    .apply(lambda x: x.str.extract(d.get(x.name, default_regex)))
                 )
    

    Less efficient alternative with numpy.select:

    lst = [lambda x: re.search(r'(^[A-Za-z]+)', x).group(),
           lambda x: re.search(r'(?<=\s)([A-Za-z]+)', x).group(),
           lambda x: re.search(r'([A-Za-z]+$)', x).group(),
          ]
    default_regex = lambda x: re.search('', x).group()
    
    df['col3'] = np.select(
        [df['col2'] == 'First', df['col2'] == 'Second', df['col2'] == 'Last'],
        [df['col1'].apply(f) for f in lst],
        df['col1'].apply(default_regex)
    )
    

    Output:

                      col1    col2    col3
    0           John Smith   First    John
    1           Jane Smith   First    Jane
    2  Pritchard James Doe  Second   James
    3     Helen Joanne Doe  Second  Joanne
    4          Walker Jean    Last    Jean
    5          Hall Jensen   Other