listcastingpython-polarspolars

polars casting a list to string


I am working in Polars and I have data set where one column is lists of strings. To see what it's like:

import pandas as pd

list_of_lists = [['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total','ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
                  , ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']]

pd_df = pd.DataFrame({'lol': list_of_lists})

Gives:

    lol
0   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
1   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
2   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
3   ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']

I want to hash each list. I thought to cast each list as a string and then hash it. I can do that with Pandas

pd_df = pd.DataFrame({'lol': list_of_lists}).astype({'lol':str})
pl_df_1 = pl.DataFrame(pd_df)
pl_df_1.with_columns(pl.col('lol')
                    .hash(seed=140)
                    .name.suffix('_hashed')
                    )

Gives:

               lol                  lol_hashed
               str                   u64
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624
"['base', 'base.current base', …    14283628883798345624

But if I try to do similar in Polars I get an error:

pl_df_2 = pl.DataFrame({'lol': list_of_lists})
pl_df_2.with_columns(pl.col('lol') # <== can insert .cast(pl.String) here still get error
                    .hash(seed=140)
                    .name.suffix('_hashed')
                    )

Gives:

# PanicException: Hashing a list with a non-numeric inner type not supported. 
#   Got dtype: List(String)

I would prefer to work just with the Polars library so is it possible to cast the column of lists as strings or is there a better way in Polars of achieving the same result?

UPDATE:

Based on the accepted answer I experimented further.

list_of_lists = [
                 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'DEFG'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'], 
                 ['base', 'base.current base', 'base.current base.inventories - total', 'HIJK'], 
                 '(bobbyJoe460)',
                 'bobby, Joe (xx866e)',
                 137642039575
                 ]

pl_df_1 = pl.DataFrame({'lol': list_of_lists}, strict=False)  # <==== allow mixed types in column
pl_df_1.with_columns(pl.col('lol')
                     .cast(pl.Categorical)  # <==== cast to Categorical
                     .hash(seed=140)
                     .name.suffix('_hashed')
                     )

Gives:

                 lol.               lol_hashed
                 str                    u64
"["base", "base.current base", …    11231070086490249882
"["base", "base.current base", …    6519339301964281776
"["base", "base.current base", …    11231070086490249882
"["base", "base.current base", …    14549859594875138034
"(bobbyJoe460)"                     1954884316252525743
"bobby, Joe (xx866e)"               4241414284122449899
"137642039575"                      6383308039250228053

Solution

  • The PanicException is a bug and could be reported.

    .list.join() can be used to create a "single string" which you can hash.

    df = pl.DataFrame({"lol": list_of_lists})
    
    df.with_columns(
        pl.col("lol").list.join("").hash()
    )
    
    shape: (4, 1)
    ┌─────────────────────┐
    │ lol                 │
    │ ---                 │
    │ u64                 │
    ╞═════════════════════╡
    │ 8244365561843513530 │
    │ 8244365561843513530 │
    │ 8244365561843513530 │
    │ 8244365561843513530 │
    └─────────────────────┘
    

    You can also cast to Categoricals which can be hashed.

    df.with_columns(
        pl.col("lol").cast(pl.List(pl.Categorical)).hash()
    )
    
    shape: (4, 1)
    ┌────────────────────┐
    │ lol                │
    │ ---                │
    │ u64                │
    ╞════════════════════╡
    │ 599332512135826737 │
    │ 599332512135826737 │
    │ 599332512135826737 │
    │ 599332512135826737 │
    └────────────────────┘