pythonpandasdaskdask-dataframe

Dask DataFrame - split multiple string columns based on delimiter into multiple rows simultaneously


I have a large Dask DataFrame that looks something like this:

CHROM POS ID REF ALT Consequence Ensembl_geneid Ensembl_proteinid Ensembl_transcriptid etc
1 10000 1-10000-A-C A C con11,con12,con13 gene11,.,gene13 prot11,.,prot13 tra11,.,tra13
1 11000 1-11000-A-G A G con21 gene21 prot21 tra21
2 20000 2-20000-T-C T C .,.,.,.,. .,.,.,.,. .,.,.,.,. .,.,.,.,.
etc

All columns have been deliberately read as strings using the Dask DataFrame read_csv(dtypes_mapping). The actual data is basically variant annotation data generated using the VEP tool, with a few million rows & about 90-100 columns in total, if that helps.

Each variant (i.e., CHROM, POS, ID, REF, ALT) has one or more comma-separated values corresponding to different effects. The "dots" are strings that the tool uses to represent missing values, which I would later replace with the proper NaN values.

There may be partial or complete missingness, but for each variant, there will always be the same number of positional values in each of the comma-separated string columns (they're meant to match up). E.g.,

I wanted to apply the Dask Series.str.split(pat=delim, n=-1, expand=False) & Dask DataFrame.explode(column=col_list) method to convert the data to long form while maintaining the positional value match-up, something like this:

CHROM POS ID REF ALT Consequence Ensembl_geneid Ensembl_proteinid Ensembl_transcriptid etc
1 10000 1-10000-A-C A C con11 gene11 prot11 tra11
1 10000 1-10000-A-C A C con12 . . .
1 10000 1-10000-A-C A C con13 gene13 prot13 tra13
1 11000 1-11000-A-G A G con21 gene21 prot21 tra21
2 20000 2-20000-T-C T C . . . .
2 20000 2-20000-T-C T C . . . .
2 20000 2-20000-T-C T C . . . .
2 20000 2-20000-T-C T C . . . .
2 20000 2-20000-T-C T C . . . .
etc

When I try the whole thing in Pandas using a test sample, I can see using .iloc[] & type() that after the Pandas Series.str.split(), Pandas DataFrame recognises individual values as list[str]. This then allows Pandas DataFrame.explode() to work as intended.

However, I can't get the same to work in Dask. Dask DataFrame seems to take individual values after the Dask Series.str.split() as str instead of list[str] ("['con11', 'con12', 'con13']" instead of ['con11', 'con12', 'con13']). This results in Dask DataFrame.explode() doing essentially nothing.

This is what I've been referring to, but I'm not sure the accepted solution would work in my case: Dask dataframe - split column into multiple rows based on delimiter

I can't tell if I'm missing something really obvious, or if it's a Dask design restriction. Any help debugging this would be greatly appreciated!


EDIT: Apologise for the delay @GuillaumeEB, I haven't been able to get back to this particular problem in the last week or so. Here's a small reproducer based on the above example:

### python = 3.12.1 (conda-forge)
### numpy = 1.26.3 (conda-forge)
### pandas = 2.1.4 (conda-forge)
### dask = 2023.12.1 (conda-forge)

import pandas as pd
from dask import dataframe as ddf

reqd_cols = ["Consequence", "Ensembl_geneid", "Ensembl_proteinid", "Ensembl_transcriptid"]

### Pandas implementation start, works as intended
df  = pd.DataFrame({
    "CHROM": [1, 1, 2],
    "POS": [10000, 11000, 20000],
    "ID": ["1-10000-A-C", "1-11000-A-G", "2-20000-T-C"],
    "REF": ["A", "A", "T"],
    "ALT": ["C", "G", "C"],
    "Consequence": ["con11,con12,con13", "con21", ".,.,.,.,."],
    "Ensembl_geneid": ["gene11,.,gene13", "gene21", ".,.,.,.,."],
    "Ensembl_proteinid": ["prot11,.,prot13", "prot21", ".,.,.,.,."],
    "Ensembl_transcriptid": ["tra11,.,tra13", "tra21", ".,.,.,.,."]
})

for col in reqd_cols:
    df[col] = df[col].str.split(pat=",", expand=False)

df = df.explode(column=reqd_cols, ignore_index=True)

print(df.info(verbose=True))
print(df.head())
### Pandas implementation end

### Dask implementation start, does not work
df = ddf.from_pandas(
    data=pd.DataFrame({
        "CHROM": [1, 1, 2],
        "POS": [10000, 11000, 20000],
        "ID": ["1-10000-A-C", "1-11000-A-G", "2-20000-T-C"],
        "REF": ["A", "A", "T"],
        "ALT": ["C", "G", "C"],
        "Consequence": ["con11,con12,con13", "con21", ".,.,.,.,."],
        "Ensembl_geneid": ["gene11,.,gene13", "gene21", ".,.,.,.,."],
        "Ensembl_proteinid": ["prot11,.,prot13", "prot21", ".,.,.,.,."],
        "Ensembl_transcriptid": ["tra11,.,tra13", "tra21", ".,.,.,.,."]
    }),
    npartitions=1
)

for col in reqd_cols:
    df[col] = df[col].str.split(pat=",", n=-1, expand=False)

df = df.explode(column=reqd_cols)

print(df.info(verbose=True))
print(df.head())
### Dask implementation end

Solution

  • Since Dask 2023.7.1:

    Dask DataFrame to automatically convert text data using object data types to string[pyarrow] if pandas>=2 and pyarrow>=12 are installed.

    In order to avoid this, you can deactivate this before creating your Dask Dataframe, just add:

    import dask
    dask.config.set({"dataframe.convert-string": False})
    

    before you create yours, and it fixes the problem.