python-3.xpandasdataframenumpy-ndarrayuniform-distribution

How to sample rows of a dataframe uniformly while maintaining the relative position of the rows?


I want to use pandas.DataFrame.sample to sample a given number of rows from a pandas dataframe uniformly. However, I want to make sure that the order of the selected rows are not contradicting the order of those same rows in the original dataframe. I am not sure how to do that; there is a physical meaning behind the order of rows and I want to preserve that. Maybe it's better to call it decimating the dataframe along its row axis rather than sampling it. What are your suggestions?

Note:

The original dataframe has 83 rows. I need to create two samples of 25 and 24 rows each.


Solution

  • df1 = original_df.sample(25)
    # returns the rest(24)
    df2 = original_df[~ original_df.index.isin(df1.index)]
    df2 = df2.sample(24)
    

    The sampled dataframes(df1 and df2) will have the index values from the original dataframe). To get the order in the original data frame you can sort the values by index

    df1 = df1.sort_index()
    df2 = df2.sort_index()