I have a dask dataframe ddf
with a matrix ddf['X']
and a list of indices indices
. I want to select the features (columns) of ddf['X']
at the indices. My current implementation is
def subselect_variables(df):
subset = df.iloc[:, indices]
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_variables, meta={col: 'f4'for col in range(len(indices))})
)
ddf_X.to_parquet(
my_path,
engine='pyarrow',
schema=my_schema,
write_metadata_file=True,
row_group_size=my_row_group_size
)
But it results in the error pandas.errors.IndexingError: Too many indexers
. Can someone help?
I also tried to directly select the features
ddf_X = (
ddf['X']
.map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4'for col in range(len(indices))})
)
Which resulted in the same error.
I also tried replacing :
with slice(None)
, which also resulted in the same error.
Thanks for your suggestions! It led me in the right direction. Indeed, if ddf['X']
is a Series, it must be treated as 1-dimensional. What you also need to consider is the meta assignment and the external function, as you might run out of memory. Here is the solution that worked:
def subselect_variables(df):
subset = df.map(lambda x: [x[i] for i in indices])
return subset
ddf_X = (
ddf['X']
.map_partitions(subselect_series, meta=('X', 'f4'))
)
To write it into a parquet file, you also need to cast it to a dask DataFrame, e.g. like
if isinstance(ddf_X, dd.Series):
ddf_X = ddf_X.to_frame(name='X')