pythondaskmasked-array

Drop masked rows from Dask dataframe when saving to file


I have several large csv files that I'm processing with Dask. I need to mask these files according to some condition, and save the masked array without the masked rows. What I currently do is:

from dask import dataframe as dd
df = dd.read_csv(filename)
msk = ... # some condition
df = df.mask(msk).compute()
df.to_csv("{}_sample.csv".format(filename), index=False)

The masking works, but the resulting file still contains the masked rows as empty rows, i.e:

...
18.702003,0.005,79.428,9.999001250124936,0.5203728231202968,0.2673634806190893,-0.58664254749603
19.102915,0.069,77.81,9.999238070973211,-0.6233755821087494,0.3886258651317274,-3.88229321744741
,,,,,,,,,,,,
,,,,,,,,,,,,
,,,,,,,,,,,,
,,,,,,,,,,,,
20.388945,0.08199999999999999,77.50999999999998,9.999336227970336,0.35936464745549523,1.23090232
,,,,,,,,,,,,
...

I've looked at the to_csv function but I see no option to drop these empty/masked rows.


Solution

  • There is no need to call .compute() (and put the masked dataframe in memory). You can use the standard pandas syntax to subset the relevant rows.

    df = df[~msk] # no need to call compute here (this is to drop masked rows)
    df.to_csv("{}_sample_*.csv".format(filename), index=False) # the * is needed for multiple partitions