i have a panda dataframe as follows:
import pandas as pd
import numpy as np
d = {'col1': ['I called the c. i. a', 'the house is e. m',
'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
I have removed the punctuations and removed the spaces between abbreviated letters:
df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
the output is (e.g 'I called the cia') what I would like to happen is however the following ('I called the CIA'). so I essentially like the abbreviations to be upper cased. I tried the following, but got no results
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)'.upper(),'')
or
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)',''.upper())
pandas.Series.str.replace
allows 2nd argument to be callable compliant with requirements of 2nd argument of re.sub
. Using that you might first uppercase your abbreviations as follows:
import pandas as pd
def make_upper(m): # where m is re.Match object
return m.group(0).upper()
d = {'col1': ['I called the c. i. a', 'the house is e. m', 'this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'\b\w\.?\b', make_upper)
print(df)
output
col1
0 I called the C. I. A
1 the house is E. M
2 this is an E. U. call!
3 how is the P. O. R going?
which then you can further processing using code you already had
df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
print(df)
output
col1
0 I called the CIA
1 the house is EM
2 this is an EU call
3 how is the POR going
You might elect to improve pattern I used (r'\b\w\.?\b'
) if you encounter cases which it does not cover. I used word boundaries and literal dot (\.
), so as is it does find any single word character (\w
) optionally (?
) followed by dot.