pandasdataframeperformancedtype

Pandas convert multiple columns datatypes at once


I need to change the dtypes of some columns of two different DF in my code. They are originally a Spark DF, so directly when I convert them to pandas I don't have the option to set to category dtype (that's why I am doing as a secondary step, not when the DF is created).

The DF's have most, not all, of the columns in common. I decided to create only one function to handle both cases. I created a logic where I store the columns names and the dtypes as a dictionary and then I loop through the dict items to convert the columns of the dataframe.

The dataframes are large (around 500k rows, 30-40 columns). The best way I found so far was using .apply to perform the transformation.

Would there be a faster way of doing this instead of using .apply and lambda that I could try? Here's a simple example with the function I created:

def dtypes_preprocess(df):

  cols_cat = ['A', 'B', 'C', 'D','E']
  cols_float = ['F']
  cols_numerical_int32 = ['G', 'H']

  dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}

  for data_type, columns_dict in dict_dtypes.items():
    existing_cols = list(set(columns_dict).intersection(df.columns)) #check which columns from the list exist in the dataframe
    df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) #perform the transformation
    
  return df

I was already able to improve a bit, my first attempt was like this:

for data_type, columns in dict_dtypes.items():
  for col in columns:
    if col in df.columns:
        df[col] = df[col].astype(data_type, errors = 'ignore')

Any suggestion is appreciated. Thank you.


Solution

  • I think your attempt is not bad. As @jqurious mentioned in the comment (thanks a lot for that), it could be done by passing mapping name -> dtype to df.astype.

    So we just need to "invert" the dict_dtypes and choose only the column names present in df

    My idea:

    def dtypes_preprocess(df):
      dict_dtypes = {
         'category' : ['A', 'B', 'C', 'D','E'],
         'float' : ['F'],
         'int32' : ['G', 'H']
      }
      dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
      return df.astype(dtypes, errors='ignore')
    

    The dictionary comprehension looks a bit harder to read, though -- any suggestions are welcome here.

    (Idea for "inverting" the dict_dtypes taken from this question.)