I have not found a way to group_by() a column and have a struct field for the new column.Here is a snippet of the dataframe:
Census_Tract | GEOID_Census | table_B24012 | value |
---|---|---|---|
Census Tract 9… | 04001942600 | B24012_001E | 32179.0 |
Census Tract 9… | 04001942700 | B24012_001E | 33483.0 |
Census Tract 9… | 04001944000 | B24012_001E | 33114.0 |
Census Tract 9… | 04001944100 | B24012_001E | 26993.0 |
Census Tract 9… | 04001944201 | B24012_001E | 33798.0 |
... | ... | ... | ... |
Census Tract 2… | 56043000200 | B24012_073E | -6.6667e8 |
Census Tract 3… | 56043000301 | B24012_073E | -6.6667e8 |
Census Tract 3… | 56043000302 | B24012_073E | -6.6667e8 |
Census Tract 9… | 56045951100 | B24012_073E | 41042.0 |
Census Tract 9… | 56045951300 | B24012_073E | -6.6667e8 |
Essentially what I am trying to do is to group by Census_Tract
and GEOID_Census
. The columns table_B24012 and value should be combined into one column.
In other words I would like the data to be a struct object such that for each GEOID_Census I have one row value with potential a dictionary structure built into the new column. It would look something like this
Census_Tract | GEOID_Census | table_B24012 |
---|---|---|
Census Tract 9… | 04001942600 | {"B24012_001E": 32179.0, "B24012_002E": ..., ...} |
Census Tract 9… | 04001942700 | {"B24012_001E": 33483.0, "B24012_002E": ..., ...} |
Census Tract 9… | 04001944000 | {"B24012_001E": 33114.0, "B24012_002E": ..., ...} |
Census Tract 9… | 04001944100 | {"B24012_001E": 26993.0, "B24012_002E": ..., ...} |
Census Tract 9… | 04001944201 | {"B24012_001E": 33798.0, "B24012_002E": ..., ...} |
If there is a more efficient way to store it in the column, I am open to other ways. I am just trying to condense the information.
I have tried casting the two columns to a list and then struct, as well as just to a dictionary. Nothing has helped. Any ideas? Thanks in advance!
It sounds like you want something similar to a .pivot()
# increase defaults
pl.Config(fmt_table_cell_list_len=10, fmt_str_lengths=100)
df = pl.from_repr("""
┌────────────────┬──────────────┬──────────────┬─────────┐
│ Census_Tract ┆ GEOID_Census ┆ table_B24012 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 │
╞════════════════╪══════════════╪══════════════╪═════════╡
│ Census Tract 9 ┆ 04001942600 ┆ B24012_001E ┆ 11111.0 │
│ Census Tract 9 ┆ 04001942700 ┆ B24012_002E ┆ 22222.0 │
│ Census Tract 9 ┆ 04001942600 ┆ B24012_002E ┆ 33333.0 │
│ Census Tract 9 ┆ 04001942700 ┆ B24012_001E ┆ 44444.0 │
│ Census Tract 9 ┆ 04001944100 ┆ B24012_001E ┆ 55555.0 │
└────────────────┴──────────────┴──────────────┴─────────┘
""")
df.pivot("table_B24012", index=["Census_Tract", "GEOID_Census"])
shape: (3, 4)
┌────────────────┬──────────────┬─────────────┬─────────────┐
│ Census_Tract ┆ GEOID_Census ┆ B24012_001E ┆ B24012_002E │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞════════════════╪══════════════╪═════════════╪═════════════╡
│ Census Tract 9 ┆ 04001942600 ┆ 11111.0 ┆ 33333.0 │
│ Census Tract 9 ┆ 04001942700 ┆ 44444.0 ┆ 22222.0 │
│ Census Tract 9 ┆ 04001944100 ┆ 55555.0 ┆ null │
└────────────────┴──────────────┴─────────────┴─────────────┘
It's easy to create a struct from there if desired:
(df.pivot("table_B24012", index=["Census_Tract", "GEOID_Census"])
.select(
"Census_Tract",
"GEOID_Census",
table_B24012 = pl.struct(pl.exclude("Census_Tract", "GEOID_Census"))
)
)
shape: (3, 3)
┌────────────────┬──────────────┬───────────────────┐
│ Census_Tract ┆ GEOID_Census ┆ table_B24012 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ struct[2] │
╞════════════════╪══════════════╪═══════════════════╡
│ Census Tract 9 ┆ 04001942600 ┆ {11111.0,33333.0} │
│ Census Tract 9 ┆ 04001942700 ┆ {44444.0,22222.0} │
│ Census Tract 9 ┆ 04001944100 ┆ {55555.0,null} │
└────────────────┴──────────────┴───────────────────┘