I am working in Polars and I have data set where one column is lists of strings. To see what it's like:
import pandas as pd
list_of_lists = [['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total','ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']]
pd_df = pd.DataFrame({'lol': list_of_lists})
Gives:
lol
0 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
1 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
2 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
3 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
I want to hash each list. I thought to cast each list as a string and then hash it. I can do that with Pandas
pd_df = pd.DataFrame({'lol': list_of_lists}).astype({'lol':str})
pl_df_1 = pl.DataFrame(pd_df)
pl_df_1.with_columns(pl.col('lol')
.hash(seed=140)
.name.suffix('_hashed')
)
Gives:
lol lol_hashed
str u64
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
But if I try to do similar in Polars I get an error:
pl_df_2 = pl.DataFrame({'lol': list_of_lists})
pl_df_2.with_columns(pl.col('lol') # <== can insert .cast(pl.String) here still get error
.hash(seed=140)
.name.suffix('_hashed')
)
Gives:
# PanicException: Hashing a list with a non-numeric inner type not supported.
# Got dtype: List(String)
I would prefer to work just with the Polars library so is it possible to cast the column of lists as strings or is there a better way in Polars of achieving the same result?
UPDATE:
Based on the accepted answer I experimented further.
list_of_lists = [
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'DEFG'],
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'HIJK'],
'(bobbyJoe460)',
'bobby, Joe (xx866e)',
137642039575
]
pl_df_1 = pl.DataFrame({'lol': list_of_lists}, strict=False) # <==== allow mixed types in column
pl_df_1.with_columns(pl.col('lol')
.cast(pl.Categorical) # <==== cast to Categorical
.hash(seed=140)
.name.suffix('_hashed')
)
Gives:
lol. lol_hashed
str u64
"["base", "base.current base", … 11231070086490249882
"["base", "base.current base", … 6519339301964281776
"["base", "base.current base", … 11231070086490249882
"["base", "base.current base", … 14549859594875138034
"(bobbyJoe460)" 1954884316252525743
"bobby, Joe (xx866e)" 4241414284122449899
"137642039575" 6383308039250228053
The PanicException is a bug and could be reported.
.list.join()
can be used to create a "single string" which you can hash.
df = pl.DataFrame({"lol": list_of_lists})
df.with_columns(
pl.col("lol").list.join("").hash()
)
shape: (4, 1)
┌─────────────────────┐
│ lol │
│ --- │
│ u64 │
╞═════════════════════╡
│ 8244365561843513530 │
│ 8244365561843513530 │
│ 8244365561843513530 │
│ 8244365561843513530 │
└─────────────────────┘
You can also cast to Categoricals which can be hashed.
df.with_columns(
pl.col("lol").cast(pl.List(pl.Categorical)).hash()
)
shape: (4, 1)
┌────────────────────┐
│ lol │
│ --- │
│ u64 │
╞════════════════════╡
│ 599332512135826737 │
│ 599332512135826737 │
│ 599332512135826737 │
│ 599332512135826737 │
└────────────────────┘