I am using polars to hash some columns in a data set. One column is contains lists of strings and the other column strings. My approach is to cast each column as type string and then hash the columns. The problem I am having is with the type casting.
I am using the with_columns method a follows:
list_of_lists = [
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'DEFG'],
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'HIJK']
]
list_of_strings = ['(bobbyJoe460)',
'bobby, Joe (xx866e)',
'137642039575',
'mamamia']
pl_df_1 = pl.DataFrame({'lists': list_of_lists,'stris':list_of_strings}, strict=False)
pl_df_1.with_columns(pl.col(['lists','stris'])
.cast(pl.List(pl.Categorical))
.hash(seed=140)
.name.suffix('_hashed')
)
Note that the cast is pl.List(pl.Categorical)
. If I omit the pl.List
the cast fails with the error
With the inclusion of pl.List
the code gives:
lists stris lists_hashed stris_hashed
list[str] str u64 u64
["base", "base.current base", … "ABCD"] "(bobbyJoe460)" 11845069150176100519 594396677107
["base", "base.current base", … "DEFG"] "bobby, Joe (xx866e)" 6761150988783483050 594396677107
["base", "base.current base", … "ABCD"] "137642039575" 11845069150176100519 594396677107
["base", "base.current base", … "HIJK"] "mamamia" 8290133271651710679 594396677107
Note that the string column all have the same hash value. Ideally I would like a boolean expression in the with_columns
that would detect the column type and if it was a List use pl.List(pl.Categorical)
and if it was String just pl.Categorical
. Is that possible?
This looks like a bug to me - I think the .cast()
should raise?
This is why the resulting .hash()
is the same for each row.
df.select(pl.col("strings").cast(pl.List(pl.Categorical)))
shape: (4, 1)
┌───────────┐
│ strings │
│ --- │
│ list[u32] │
╞═══════════╡
│ [null] │
│ [null] │
│ [null] │
│ [null] │
└───────────┘
You can select columns by type.
pl.col(pl.List(pl.String)).cast(pl.List(pl.Categorical)).hash(seed=140).name.suffix("_hashed"),
pl.col(pl.String).cast(pl.Categorical).hash(seed=140).name.suffix("_hashed")
You could use some form of loop to generate the expressions which may be neater.
df.with_columns(
pl.col(old).cast(new).hash(seed=140).name.suffix("_hashed")
for old, new in {
pl.String: pl.Categorical,
pl.List(pl.String): pl.List(pl.Categorical)
}.items()
)
shape: (4, 4)
┌─────────────────────────────────┬─────────────────────┬──────────────────────┬──────────────────────┐
│ lists ┆ strings ┆ strings_hashed ┆ lists_hashed │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ u64 ┆ u64 │
╞═════════════════════════════════╪═════════════════════╪══════════════════════╪══════════════════════╡
│ ["base", "base.current base", … ┆ (bobbyJoe460) ┆ 11231070086490249882 ┆ 11845069150176100519 │
│ ["base", "base.current base", … ┆ bobby, Joe (xx866e) ┆ 6519339301964281776 ┆ 6761150988783483050 │
│ ["base", "base.current base", … ┆ 137642039575 ┆ 14549859594875138034 ┆ 11845069150176100519 │
│ ["base", "base.current base", … ┆ mamamia ┆ 1954884316252525743 ┆ 8290133271651710679 │
└─────────────────────────────────┴─────────────────────┴──────────────────────┴──────────────────────┘