pandasdataframedaskdask-dataframe

Subselect features in Dask Dataframe


I have a dask dataframe ddf with a matrix ddf['X'] and a list of indices indices. I want to select the features (columns) of ddf['X'] at the indices. My current implementation is

def subselect_variables(df):
    subset = df.iloc[:, indices]
    return subset
ddf_X = (
        ddf['X']
        .map_partitions(subselect_variables, meta={col: 'f4'for col in range(len(indices))})
    )
ddf_X.to_parquet(
    my_path,
    engine='pyarrow',
    schema=my_schema,
    write_metadata_file=True,
    row_group_size=my_row_group_size
    )

But it results in the error pandas.errors.IndexingError: Too many indexers. Can someone help?

I also tried to directly select the features

ddf_X = (
        ddf['X']
        .map_partitions(lambda df: df.iloc[:, indices], meta={col: 'f4'for col in range(len(indices))})
    )

Which resulted in the same error. I also tried replacing : with slice(None), which also resulted in the same error.


Solution

  • Thanks for your suggestions! It led me in the right direction. Indeed, if ddf['X'] is a Series, it must be treated as 1-dimensional. What you also need to consider is the meta assignment and the external function, as you might run out of memory. Here is the solution that worked:

    def subselect_variables(df):
        subset = df.map(lambda x: [x[i] for i in indices])
        return subset
    
    ddf_X = (
            ddf['X']
            .map_partitions(subselect_series, meta=('X', 'f4'))
        )
    

    To write it into a parquet file, you also need to cast it to a dask DataFrame, e.g. like

    if isinstance(ddf_X, dd.Series):
            ddf_X = ddf_X.to_frame(name='X')