pythonpandasuppercaseabbreviation

Remove space between abbreviated letters in a string column


i have a panda dataframe as follows:

import pandas as pd
import numpy as np

d = {'col1': ['I called the c. i. a', 'the house is e. m',
 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)

I have removed the punctuations and removed the spaces between abbreviated letters:

df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')

the output is (e.g 'I called the cia') what I would like to happen is however the following ('I called the CIA'). so I essentially like the abbreviations to be upper cased. I tried the following, but got no results

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)'.upper(),'')

or

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)',''.upper())

Solution

  • pandas.Series.str.replace allows 2nd argument to be callable compliant with requirements of 2nd argument of re.sub. Using that you might first uppercase your abbreviations as follows:

    import pandas as pd
    def make_upper(m):  # where m is re.Match object
        return m.group(0).upper()
    d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']}
    df = pd.DataFrame(data=d)
    df['col1'] = df['col1'].str.replace(r'\b\w\.?\b', make_upper)
    print(df)
    

    output

                            col1
    0       I called the C. I. A
    1          the house is E. M
    2     this is an E. U. call!
    3  how is the P. O. R going?
    

    which then you can further processing using code you already had

    df['col1'] = df['col1'].str.replace('[^\w\s]','')
    df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
    print(df)
    

    output

                   col1
    0      I called the CIA
    1       the house is EM
    2    this is an EU call
    3  how is the POR going
    

    You might elect to improve pattern I used (r'\b\w\.?\b') if you encounter cases which it does not cover. I used word boundaries and literal dot (\.), so as is it does find any single word character (\w) optionally (?) followed by dot.