I have a huge dataframe. Following a group_by
operation, I have a list of strings corresponding to every element of the first column. What I need is to be able to quickly find common strings between some particular i'th row with all the other rows. I could do that in Pandas by saving the above dataframe as a pickle file. The solution was suboptimal as loading takes a very long time.
I then found polars to be promising, except that I cannot store the dataframe with column of sets in any format that it supports for quick loading. So that leaves the alternate solution of storing as a list but quickly converting the grouped column to sets after loading from parquet. (I faced the same problems with datatables and vaex too.)
The solution with polars that I found was to use .map_elements
. But it works in a single thread and is very slow. The code I used was as follows:
df = pl.from_repr("""
┌────────┬────────┐
│ ColA ┆ ColB │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪════════╡
│ apple ┆ boy │
│ orange ┆ ball │
│ apple ┆ bamboo │
│ orange ┆ bull │
└────────┴────────┘
""")
df = df.lazy().group_by('ColA').agg('ColB').collect()
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA ┆ ColB │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪═══════════════════╡
│ apple ┆ ["boy", "bamboo"] │
│ orange ┆ ["ball", "bull"] │
└────────┴───────────────────┘
df.with_columns(
pl.col('ColB').map_elements(set)
)
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA ┆ ColB │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═══════════════════╡
│ apple ┆ {'boy', 'bamboo'} │
│ orange ┆ {'ball', 'bull'} │
└────────┴───────────────────┘
I found discussion on using map_batches
, but it works on series only. Unlike in that example that worked on per element basis, when I used np.asarray
to convert to numpy array (to apply intersect on them later), it also gave me an object
column.
df.select(pl.all().map_batches(np.asarray))
shape: (2, 2)
┌────────┬──────────────────┐
│ ColA ┆ ColB │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪══════════════════╡
│ apple ┆ ['boy' 'bamboo'] │
│ orange ┆ ['ball' 'bull'] │
└────────┴──────────────────┘
I would like to know where I went wrong, and how to use multi-threads (as with map) to convert a column of list to a column of numpy array (or preferably sets).
Not perhaps the best approach, but the following worked reasonably well.
>>> my_dict = dict(df.to_numpy().tolist())
>>> my_dict
{'orange': array(['ball', 'bull', 'boy'], dtype=object), 'apple': array(['boy', 'bamboo'], dtype=object)}
>>> for i in my_dict:
... my_dict[i] = set(my_dict[i])
...
>>> my_dict
{'orange': {'ball', 'boy', 'bull'}, 'apple': {'bamboo', 'boy'}}