I have a data frame of 1 million records with 5 columns.
unique_index,name,company_name,city_id,state_id
Column, company_name
, has 100k unique records. This follows a power law. Top 5000 company_names
cover 70% of the records.
I want to take equal number of samples from the companies which contribute to the top 5000 of the data and from the remaining set.
I tried pd.qcut(df['company_name'],[0.25,1]
. This gave me the below error:
TypeError: unorderable types: str() <= float()
. Can qcut
not be applied to strings?
You could try grabbing the top companies by value_counts()
and then creating a new column with True/False
if it's in/out of the top companies. I think it would look something like this:
top5000 = df['company_name'].value_counts().index[0:5000].tolist()
df['InTop'] = df['company_name'].isin(top5000)
This would allow you to sample from the group where df['InTop'] == True
and the group where df['InTop'] == False