I need to change the dtypes of some columns of two different DF in my code. They are originally a Spark DF, so directly when I convert them to pandas I don't have the option to set to category dtype (that's why I am doing as a secondary step, not when the DF is created).
The DF's have most, not all, of the columns in common. I decided to create only one function to handle both cases. I created a logic where I store the columns names and the dtypes as a dictionary and then I loop through the dict items to convert the columns of the dataframe.
The dataframes are large (around 500k rows, 30-40 columns). The best way I found so far was using .apply to perform the transformation.
Would there be a faster way of doing this instead of using .apply and lambda that I could try? Here's a simple example with the function I created:
def dtypes_preprocess(df):
cols_cat = ['A', 'B', 'C', 'D','E']
cols_float = ['F']
cols_numerical_int32 = ['G', 'H']
dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}
for data_type, columns_dict in dict_dtypes.items():
existing_cols = list(set(columns_dict).intersection(df.columns)) #check which columns from the list exist in the dataframe
df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) #perform the transformation
return df
I was already able to improve a bit, my first attempt was like this:
for data_type, columns in dict_dtypes.items():
for col in columns:
if col in df.columns:
df[col] = df[col].astype(data_type, errors = 'ignore')
Any suggestion is appreciated. Thank you.
I think your attempt is not bad. As @jqurious mentioned in the comment (thanks a lot for that), it could be done by passing mapping name -> dtype
to df.astype
.
So we just need to "invert" the dict_dtypes
and choose only the column names present in df
My idea:
def dtypes_preprocess(df):
dict_dtypes = {
'category' : ['A', 'B', 'C', 'D','E'],
'float' : ['F'],
'int32' : ['G', 'H']
}
dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
return df.astype(dtypes, errors='ignore')
The dictionary comprehension looks a bit harder to read, though -- any suggestions are welcome here.
(Idea for "inverting" the dict_dtypes
taken from this question.)