pythondask

Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'


I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error

ValueError: Cannot take a larger sample than population when 'replace=False'

Can I sample without replacement if the dataframe is larger than the sample size?


Solution

  • The sample method only supports the frac= keyword argument. See the API documentation

    The error that you're getting is from Pandas, not Dask.

    In [1]: import pandas as pd
    In [2]: df = pd.DataFrame({'x': [1]})
    In [3]: df.sample(frac=2000, replace=False)
    ValueError: Cannot take a larger sample than population when 'replace=False'
    

    Solution 1

    As the Pandas error suggests, consider sampling with replacement

    In [4]: df.sample(frac=2, replace=True)
    Out[4]: 
       x
    0  1
    0  1
    
    In [5]: import dask.dataframe as dd
    In [6]: ddf = dd.from_pandas(df, npartitions=1)
    In [7]: ddf.sample(frac=2, replace=True).compute()
    Out[7]: 
       x
    0  1
    0  1
    

    Solution 2

    This may help someone..

    I found this from some place and cannot remember where.

    This will show you the results correctly without error. (This is for pandas, and I don't know about dask).

    import pandas as pd
    
    df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                       'b': [1,1,1,2,2,3,3]})
    
    # this is fixed number, will be error when data in group is less than sample size
    df.groupby('b').apply(pd.DataFrame.sample, n=1)
    
    # this is flexible with min, no error, will return 3 or less than that
    df.groupby(['b'], as_index=False, group_keys=False
              ).apply(
                lambda x: x.sample(min(3, len(x)))
            )