python-3.xpandasdataframeinteger

Recasting values in Pandas dataframe that meet specific requirements


I have a Pandas dataframe that has been created by importing from an Excel spreadsheet. When using the .dtypes method, the column appears to be of data type object. However, within the single column, there are data of multiple types such as str, int and float. There may also be missing values (NaN). Some of the string values are actually string representations of integer values, specifically, in this case, 4-digit years. I would like to be able to recast string values that consist of 4 digits to integers but leave all the other values (and datatypes) unchanged.

As an example, a minimal dataframe might look like:

import numpy as np
import pandas as pd
import re

testdf = pd.DataFrame({'col1':['abc','2023',456,789,'2021',4.5,'123',np.nan]})
   col1
0   abc
1  2023
2   456
3   789
4  2021
5   4.5
6   123
7   NaN

with dtype:

col1    object
dtype: object

However, data types in individual cells vary:

testdf['col1_types'] = testdf['col1'].apply(type)

   col1       col1_types
0   abc    <class 'str'>
1  2023    <class 'str'>
2   456    <class 'int'>
3   789    <class 'int'>
4  2021    <class 'str'>
5   4.5  <class 'float'>
6   123    <class 'str'>
7   NaN  <class 'float'>

The solution I've come up with involves several steps as shown below (with outcomes of individual steps added as new columns in the dataframe for clarity) but the process seems extremely clunky. Intuitively, I think there should be a much easier way of doing it - perhaps even as a one-liner - but I haven't been able to get the syntax worked out. The steps I've used are as follows:

Step 1 - Create a mask indicating which cells contain strings

strmask = testdf['col1'].apply(type) == str
testdf['strmask'] = strmask

   col1       col1_types  strmask
0   abc    <class 'str'>     True
1  2023    <class 'str'>     True
2   456    <class 'int'>    False
3   789    <class 'int'>    False
4  2021    <class 'str'>     True
5   4.5  <class 'float'>    False
6   123    <class 'str'>     True
7   NaN  <class 'float'>    False

Step 2 - Test if cells containing strings match regex '20\d{2}'; if so, recast as int, otherwise leave cell unchanged

testdf['col2'] = testdf.loc[phjStrMask,'col1'].apply(lambda c: int(c) if re.match('20\d{2}',c) else c)

   col1       col1_types  strmask  col2
0   abc    <class 'str'>     True   abc
1  2023    <class 'str'>     True  2023
2   456    <class 'int'>    False   NaN
3   789    <class 'int'>    False   NaN
4  2021    <class 'str'>     True  2021
5   4.5  <class 'float'>    False   NaN
6   123    <class 'str'>     True   123
7   NaN  <class 'float'>    False   NaN

Step 3 - Cells which originally contained something other than a string are currently represented as NaN in col2. Create a mask indicating which cells in col2 contain NaN and then replace with original content (in col1)

nanmask = testdf['col2'].isnull()
testdf.loc[nanmask,'col2'] = testdf['col1']

   col1       col1_types  strmask  col2
0   abc    <class 'str'>     True   abc
1  2023    <class 'str'>     True  2023
2   456    <class 'int'>    False   456
3   789    <class 'int'>    False   789
4  2021    <class 'str'>     True  2021
5   4.5  <class 'float'>    False   4.5
6   123    <class 'str'>     True   123
7   NaN  <class 'float'>    False   NaN

Step 4 - Confirm that the data types in the new column are as expected

testdf['col2_types'] = testdf['col2'].apply(type)

   col1       col1_types  strmask  col2       col2_types
0   abc    <class 'str'>     True   abc    <class 'str'>
1  2023    <class 'str'>     True  2023    <class 'int'>
2   456    <class 'int'>    False   456    <class 'int'>
3   789    <class 'int'>    False   789    <class 'int'>
4  2021    <class 'str'>     True  2021    <class 'int'>
5   4.5  <class 'float'>    False   4.5  <class 'float'>
6   123    <class 'str'>     True   123    <class 'str'>
7   NaN  <class 'float'>    False   NaN  <class 'float'>

The above process seems to work but is long-winded and clunky. Is there a way to directly recast as int the string values in col1 that match the regex without having to go through a series of intermediate steps?


Solution

  • Actually you were very close. Just replace c by str(c) in your apply and it is done in 1 line:

    testdf = pd.DataFrame({'col1':['abc','2023',456,789,'2021',4.5,'123',np.nan]})
    testdf['col2'] = testdf['col1'].apply(lambda c: int(c) if re.match('20\d{2}',str(c)) else c)
    display(testdf['col2'].apply(type))
    
    0      <class 'str'>
    1      <class 'int'>
    2      <class 'int'>
    3      <class 'int'>
    4      <class 'int'>
    5    <class 'float'>
    6      <class 'str'>
    7    <class 'float'>