pythonpandassamplingpower-law

Pandas: How to split a column of a dataframe, which follows powerlaw into two based on frequency distribution?


I have a data frame of 1 million records with 5 columns.

unique_index,name,company_name,city_id,state_id

Column, company_name, has 100k unique records. This follows a power law. Top 5000 company_names cover 70% of the records.

Power law

I want to take equal number of samples from the companies which contribute to the top 5000 of the data and from the remaining set.

I tried pd.qcut(df['company_name'],[0.25,1]. This gave me the below error: TypeError: unorderable types: str() <= float(). Can qcut not be applied to strings?


Solution

  • You could try grabbing the top companies by value_counts() and then creating a new column with True/False if it's in/out of the top companies. I think it would look something like this:

    top5000 = df['company_name'].value_counts().index[0:5000].tolist()
    df['InTop'] = df['company_name'].isin(top5000)
    

    This would allow you to sample from the group where df['InTop'] == True and the group where df['InTop'] == False