pythonlambdaapplypy-datatable

Python datatable - apply lambda to multiple columns


I am looking to apply a function to multiple columns to a datatable in Python. With R's data.table one would:

# columns to apply function to
x <- c('col_1', 'col_2')

# apply
df[, (x) := lapply(.SD, function(x) as.Date(x, "%Y-%m-%d")), .SDcols=x]

How would one do the same using Python's datatable? I have some knowledge of apply and lambda with pandas e.g.:

# create dummy data
df = pd.DataFrame({'col_1': ['2021-12-01']
                   , 'col_2': ['2021-12-02']
                   , 'col_3': ['foobar']
                   }
                  )

# columns to apply function to
x = ['col_1', 'col_2']

# apply
df[x] = df[x].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d'))

but what is its equivalent in Python's datatable? This is assuming I insist on the use of apply and lambda. Thank you.

edit* I have changed from an UDF to a standard function pd.to_datetime as some of us mentioned the former is not possible while the latter is. Feel free to use any examples to illustrate apply in conjunction with datatable. Thank you


Solution

  • I recently made a PR showing ways to transform columns in datatable; it should be merged soon. Please feel free to comment and update it.

    To the question, you can directly assign, and also use the update method:

    from datatable import dt, f, update, Type, as_type
    
    DT0 = dt.Frame({'col_1': ['2021-12-01']
                       , 'col_2': ['2021-12-02']
                       , 'col_3': ['foobar']
                       }
                      )
    
    cols = ['col_1', 'col_2']
    
    DT0
       | col_1       col_2       col_3 
       | str32       str32       str32 
    -- + ----------  ----------  ------
     0 | 2021-12-01  2021-12-02  foobar
    [1 row x 3 columns]
    
    

    Via reassignment:

    DT = DT0.copy()
    
    DT[:, cols] = DT[:, as_type(f[cols], Type.date32)]
    
    DT
       | col_1       col_2       col_3 
       | date32      date32      str32 
    -- + ----------  ----------  ------
     0 | 2021-12-01  2021-12-02  foobar
    [1 row x 3 columns]
    

    With Direct assignment, you can assign the f-expression to the column; this only works for single assignment:

    DT = DT0.copy()
    
    DT['col_1'] = as_type(f.col_1, Type.date32)
    
    DT['col_2'] = as_type(f.col_2, Type.date32)
    
    DT
     
       | col_1       col_2       col_3 
       | date32      date32      str32 
    -- + ----------  ----------  ------
     0 | 2021-12-01  2021-12-02  foobar
    [1 row x 3 columns]
    
    

    The update function works as well; I like the feature, especially for SQL window like operations, where I do not want the the order of the columns to change (datatable sorts when performing a groupby):

    DT = DT0.copy()
    
    DT[:, update(col_1 = dt.as_type(f.col_1, Type.date32), 
                 col_2 = dt.as_type(f.col_2, Type.date32))]
    DT
       | col_1       col_2       col_3 
       | date32      date32      str32 
    -- + ----------  ----------  ------
     0 | 2021-12-01  2021-12-02  foobar
    [1 row x 3 columns]
    

    Note that update is in-place; no re-assignment is required. For multiple columns, a dictionary can help to automate the process:

    columns = {col : as_type(f[col], Type.date32) for col in cols}
    
    print(columns)
    {'col_1': FExpr<as_type(f['col_1'], date32)>,
     'col_2': FExpr<as_type(f['col_2'], date32)>}
    
    # unpack the dictionary within the datatable brackets
    DT = DT0.copy()
    DT[:, update(**columns)]
    
    DT
       | col_1       col_2       col_3 
       | date32      date32      str32 
    -- + ----------  ----------  ------
     0 | 2021-12-01  2021-12-02  foobar
    [1 row x 3 columns]