I have a Pandas dataframe that has been created by importing from an Excel spreadsheet. When using the .dtypes
method, the column appears to be of data type object
. However, within the single column, there are data of multiple types such as str
, int
and float
. There may also be missing values (NaN
). Some of the string values are actually string representations of integer values, specifically, in this case, 4-digit years. I would like to be able to recast string values that consist of 4 digits to integers but leave all the other values (and datatypes) unchanged.
As an example, a minimal dataframe might look like:
import numpy as np
import pandas as pd
import re
testdf = pd.DataFrame({'col1':['abc','2023',456,789,'2021',4.5,'123',np.nan]})
col1
0 abc
1 2023
2 456
3 789
4 2021
5 4.5
6 123
7 NaN
with dtype:
col1 object
dtype: object
However, data types in individual cells vary:
testdf['col1_types'] = testdf['col1'].apply(type)
col1 col1_types
0 abc <class 'str'>
1 2023 <class 'str'>
2 456 <class 'int'>
3 789 <class 'int'>
4 2021 <class 'str'>
5 4.5 <class 'float'>
6 123 <class 'str'>
7 NaN <class 'float'>
The solution I've come up with involves several steps as shown below (with outcomes of individual steps added as new columns in the dataframe for clarity) but the process seems extremely clunky. Intuitively, I think there should be a much easier way of doing it - perhaps even as a one-liner - but I haven't been able to get the syntax worked out. The steps I've used are as follows:
Step 1 - Create a mask indicating which cells contain strings
strmask = testdf['col1'].apply(type) == str
testdf['strmask'] = strmask
col1 col1_types strmask
0 abc <class 'str'> True
1 2023 <class 'str'> True
2 456 <class 'int'> False
3 789 <class 'int'> False
4 2021 <class 'str'> True
5 4.5 <class 'float'> False
6 123 <class 'str'> True
7 NaN <class 'float'> False
Step 2 - Test if cells containing strings match regex '20\d{2}'
; if so, recast as int
, otherwise leave cell unchanged
testdf['col2'] = testdf.loc[phjStrMask,'col1'].apply(lambda c: int(c) if re.match('20\d{2}',c) else c)
col1 col1_types strmask col2
0 abc <class 'str'> True abc
1 2023 <class 'str'> True 2023
2 456 <class 'int'> False NaN
3 789 <class 'int'> False NaN
4 2021 <class 'str'> True 2021
5 4.5 <class 'float'> False NaN
6 123 <class 'str'> True 123
7 NaN <class 'float'> False NaN
Step 3 - Cells which originally contained something other than a string are currently represented as NaN
in col2
. Create a mask indicating which cells in col2
contain NaN
and then replace with original content (in col1
)
nanmask = testdf['col2'].isnull()
testdf.loc[nanmask,'col2'] = testdf['col1']
col1 col1_types strmask col2
0 abc <class 'str'> True abc
1 2023 <class 'str'> True 2023
2 456 <class 'int'> False 456
3 789 <class 'int'> False 789
4 2021 <class 'str'> True 2021
5 4.5 <class 'float'> False 4.5
6 123 <class 'str'> True 123
7 NaN <class 'float'> False NaN
Step 4 - Confirm that the data types in the new column are as expected
testdf['col2_types'] = testdf['col2'].apply(type)
col1 col1_types strmask col2 col2_types
0 abc <class 'str'> True abc <class 'str'>
1 2023 <class 'str'> True 2023 <class 'int'>
2 456 <class 'int'> False 456 <class 'int'>
3 789 <class 'int'> False 789 <class 'int'>
4 2021 <class 'str'> True 2021 <class 'int'>
5 4.5 <class 'float'> False 4.5 <class 'float'>
6 123 <class 'str'> True 123 <class 'str'>
7 NaN <class 'float'> False NaN <class 'float'>
The above process seems to work but is long-winded and clunky. Is there a way to directly recast as int
the string values in col1
that match the regex without having to go through a series of intermediate steps?
Actually you were very close. Just replace c by str(c) in your apply and it is done in 1 line:
testdf = pd.DataFrame({'col1':['abc','2023',456,789,'2021',4.5,'123',np.nan]})
testdf['col2'] = testdf['col1'].apply(lambda c: int(c) if re.match('20\d{2}',str(c)) else c)
display(testdf['col2'].apply(type))
0 <class 'str'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'float'>
6 <class 'str'>
7 <class 'float'>