I am looking to apply a function to multiple columns to a datatable
in Python. With R
's data.table
one would:
# columns to apply function to
x <- c('col_1', 'col_2')
# apply
df[, (x) := lapply(.SD, function(x) as.Date(x, "%Y-%m-%d")), .SDcols=x]
How would one do the same using Python's datatable
? I have some knowledge of apply
and lambda
with pandas
e.g.:
# create dummy data
df = pd.DataFrame({'col_1': ['2021-12-01']
, 'col_2': ['2021-12-02']
, 'col_3': ['foobar']
}
)
# columns to apply function to
x = ['col_1', 'col_2']
# apply
df[x] = df[x].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d'))
but what is its equivalent in Python's datatable
? This is assuming I insist on the use of apply
and lambda
. Thank you.
edit* I have changed from an UDF to a standard function pd.to_datetime
as some of us mentioned the former is not possible while the latter is. Feel free to use any examples to illustrate apply
in conjunction with datatable
. Thank you
I recently made a PR showing ways to transform columns in datatable
; it should be merged soon. Please feel free to comment and update it.
To the question, you can directly assign, and also use the update
method:
from datatable import dt, f, update, Type, as_type
DT0 = dt.Frame({'col_1': ['2021-12-01']
, 'col_2': ['2021-12-02']
, 'col_3': ['foobar']
}
)
cols = ['col_1', 'col_2']
DT0
| col_1 col_2 col_3
| str32 str32 str32
-- + ---------- ---------- ------
0 | 2021-12-01 2021-12-02 foobar
[1 row x 3 columns]
Via reassignment:
DT = DT0.copy()
DT[:, cols] = DT[:, as_type(f[cols], Type.date32)]
DT
| col_1 col_2 col_3
| date32 date32 str32
-- + ---------- ---------- ------
0 | 2021-12-01 2021-12-02 foobar
[1 row x 3 columns]
With Direct assignment, you can assign the f-expression to the column; this only works for single assignment:
DT = DT0.copy()
DT['col_1'] = as_type(f.col_1, Type.date32)
DT['col_2'] = as_type(f.col_2, Type.date32)
DT
| col_1 col_2 col_3
| date32 date32 str32
-- + ---------- ---------- ------
0 | 2021-12-01 2021-12-02 foobar
[1 row x 3 columns]
The update
function works as well; I like the feature, especially for SQL window like operations, where I do not want the the order of the columns to change (datatable sorts when performing a groupby):
DT = DT0.copy()
DT[:, update(col_1 = dt.as_type(f.col_1, Type.date32),
col_2 = dt.as_type(f.col_2, Type.date32))]
DT
| col_1 col_2 col_3
| date32 date32 str32
-- + ---------- ---------- ------
0 | 2021-12-01 2021-12-02 foobar
[1 row x 3 columns]
Note that update
is in-place; no re-assignment is required.
For multiple columns, a dictionary can help to automate the process:
columns = {col : as_type(f[col], Type.date32) for col in cols}
print(columns)
{'col_1': FExpr<as_type(f['col_1'], date32)>,
'col_2': FExpr<as_type(f['col_2'], date32)>}
# unpack the dictionary within the datatable brackets
DT = DT0.copy()
DT[:, update(**columns)]
DT
| col_1 col_2 col_3
| date32 date32 str32
-- + ---------- ---------- ------
0 | 2021-12-01 2021-12-02 foobar
[1 row x 3 columns]