I have a Pandas DataFrame/Polars dataframe / Pyarrow table with a string key column. You can assume the strings are random. I want to partition that dataframe into N smaller dataframes based on this key column.
With an integer column, I can just use df1 = df[df.key % N == 1]
, df2 = df[df.key % N == 2]
etc.
My best guess at how you are going to do that with a string column is apply a hash function (e.g. summing the ascii values of the string) to convert it to an integer column, then use the modulus.
Please let me know what's the most efficient way this can be done in either Pandas, Polars or Pyarrow, ideally with pure columnar operations within the API. Doing a df.apply is likely too slow for my use case.
I would try using hash_rows
to see how it performs on your dataset and computing platform. (Note that in the calculation, I'm effectively selecting only the key
field and running the hash_rows
on that)
N = 50
df = df.with_columns(
pl.lit(df.select('key').hash_rows() % N).alias('hash')
)
I just ran this on a dataset with almost 49 million records on a 32-core system, and it completed within seconds. (The 'key' field in my dataset was last names of people.)
I should also note, there's a partition_by
method that may be of help in the partitioning.