Polars, compute new column by cross referencing two dataframes

First of all I have to say this is my first time using any DataFrame module.

I have need to process FEM simulation results that I have loaded into polars dataframe, which produce two DataFrames:

nodes_df = pl.DataFrame(
   {"node":[1,2,3,4,5,6,7,8,9,10],
       "x":[4.5,-4.6, 3.8,-3.8, 2.1,-9.3, 1.5,-6.7, 0.0, 3.6],
       "y":[9.8,-9.9, 8.2,-8.3, 4.6,-2.0, 1.4,-6.1, 0.0, 1.0],
       "z":[0.0,0.0,8.8,8.8,1.2,1.2,5.5,5.5,5.5,0.0,]})

elements = [1, 2, 3]
elements_df = pl.DataFrame(
    {
        "element": elements,
        "N_1": [1, 1, 1],
        "N_2": [2, 2, 2],
        "N_3": [3, 3, 3],
        "N_4": [4, 4, 4],
        "N_5": [5, 5, 5],
        "N_6": [6, 6, 6],
        "N_7": [7, 7, 7],
        "N_8": [8, 8, 8],
        "N_9": [9, 9, 9],
        "N_10": [10, 10, 10],
    }
)

This data represent the values of a 10 node elements C3D10 like this:

the elements_dfcontains the definition for all elements, where the columns N_{x} are de ids of it's nodes (the node order is important), so each element always have 10 different nodes. The nodes_df can have more columns with other simulation results that I will process latter. The Idea is pass that results stored in the nodes_df to the element centroid.

The example have 10 (fake) nodes, and 3 (equals fake) elements, this are just sample data every row will be different. Where nodes_df has an "node" column to represent the node id, and x,y,z are the node coordinates. The elements_df has an "element" column for the id, and N_<int[1:n]> columns for it's nodes ids. First I´dont know if this schema is better or worst than having a single column with a list of nodes in one column, thinking on performance and easy of use of the dataframe, I have no problen to change the schema.

But I need to make cross reference of both dataframes to make several computations of my results. To start I need to compute the the element centroid.

As I'm new with dataframe I did this to clarify my point, but I know that this is the worst DataFrame code that anyone can write:

def coords(nodes: list):
    # get nodes coordinates for a node list
    return (
        nodes_df.filter(pl.col("node").is_in(nodes))
        .select((pl.col(("x", "y", "z"))))
        .mean()
    )

def centroid(el_id: int):
    # get the first four nodes ids for el with id = el_id
    nodes = elements_df.filter(pl.col("element") == el_id).to_numpy()[0][:4]
    return coords(nodes)

# loop over all elements and create new dataframe
# or maybe use this schema:
# _tmp = {"element": [], "x": [],"y": [],"z": []}
_tmp = {"element": [], "centroid": []}
for el in elements:
    c = centroid(el).to_numpy()[0]
    _tmp["element"].append(el)
    _tmp["centroid"].append(c)
    # _tmp["x"].append(c[0])
    # _tmp["y"].append(c[1])
    # _tmp["z"].append(c[2])

# concat dataframes to add the new "centroid" column
elements_df = pl.concat((elements_df, pl.DataFrame(_tmp)), how="align")

the results will be something like:

shape: (3, 12)
┌─────────┬─────┬─────┬─────┬───┬─────┬─────┬──────┬───────────────────────────┐
│ element ┆ N_1 ┆ N_2 ┆ N_3 ┆ … ┆ N_8 ┆ N_9 ┆ N_10 ┆ centroid                  │
│ ---     ┆ --- ┆ --- ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---                       │
│ i64     ┆ i64 ┆ i64 ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ list[f64]                 │
╞═════════╪═════╪═════╪═════╪═══╪═════╪═════╪══════╪═══════════════════════════╡
│ 1       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 8   ┆ 9   ┆ 10   ┆ [1.233333, 2.7, 2.933333] │
│ 2       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 8   ┆ 9   ┆ 10   ┆ [1.233333, 2.7, 2.933333] │
│ 3       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 8   ┆ 9   ┆ 10   ┆ [1.233333, 2.7, 2.933333] │
└─────────┴─────┴─────┴─────┴───┴─────┴─────┴──────┴───────────────────────────┘

or the commented schema:

┌─────────┬─────┬─────┬─────┬───┬──────┬──────────┬─────┬──────────┐
│ element ┆ N_1 ┆ N_2 ┆ N_3 ┆ … ┆ N_10 ┆ x        ┆ y   ┆ z        │
│ ---     ┆ --- ┆ --- ┆ --- ┆   ┆ ---  ┆ ---      ┆ --- ┆ ---      │
│ i64     ┆ i64 ┆ i64 ┆ i64 ┆   ┆ i64  ┆ f64      ┆ f64 ┆ f64      │
╞═════════╪═════╪═════╪═════╪═══╪══════╪══════════╪═════╪══════════╡
│ 1       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
│ 2       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
│ 3       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
└─────────┴─────┴─────┴─────┴───┴──────┴──────────┴─────┴──────────┘

I'm sure that what I did here can be done directly in polar but I don't know where to start. This is my simplest, but I wold have to compute other things by cross referencing the both dataframes, so I want to learn how to do it. And also looking for advices on my schemas

Solution

Optional Intermediate Step

I would start off with a unpivot/join like this:

(
    elements_df
    .unpivot(on=['N_1', 'N_2','N_3'], index='element', value_name='node')
    .join(nodes_df, on='node')
    .group_by('element', maintain_order=True)
    .agg(pl.col('x','y','z').mean())
    .with_columns(centroid=pl.concat_list('x','y','z')).drop('x','y','z')
)
shape: (3, 2)
┌─────────┬───────────────────────────┐
│ element ┆ centroid                  │
│ ---     ┆ ---                       │
│ i64     ┆ list[f64]                 │
╞═════════╪═══════════════════════════╡
│ 1       ┆ [1.233333, 2.7, 2.933333] │
│ 2       ┆ [1.233333, 2.7, 2.933333] │
│ 3       ┆ [1.233333, 2.7, 2.933333] │
└─────────┴───────────────────────────┘

If you want the x/y/z in their own columns then remove the last .with_columns.

If you want the N_{x} columns back then you have to do another join back to the original:

Final Answer

(
    elements_df
    .unpivot(on=['N_1', 'N_2','N_3'], index='element', value_name='node')
    .join(nodes_df, on='node')
    .group_by('element', maintain_order=True)
    .agg(pl.col('x','y','z').mean())
    .join(elements_df, on='element')
    .select(pl.exclude('x','y','z'), pl.col('x','y','z'))
)
shape: (3, 14)
┌─────────┬─────┬─────┬─────┬───┬──────┬──────────┬─────┬──────────┐
│ element ┆ N_1 ┆ N_2 ┆ N_3 ┆ … ┆ N_10 ┆ x        ┆ y   ┆ z        │
│ ---     ┆ --- ┆ --- ┆ --- ┆   ┆ ---  ┆ ---      ┆ --- ┆ ---      │
│ i64     ┆ i64 ┆ i64 ┆ i64 ┆   ┆ i64  ┆ f64      ┆ f64 ┆ f64      │
╞═════════╪═════╪═════╪═════╪═══╪══════╪══════════╪═════╪══════════╡
│ 1       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
│ 2       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
│ 3       ┆ 1   ┆ 2   ┆ 3   ┆ … ┆ 10   ┆ 1.233333 ┆ 2.7 ┆ 2.933333 │
└─────────┴─────┴─────┴─────┴───┴──────┴──────────┴─────┴──────────┘

Note that the last select is only necessary to put the columns in order.

Alternate for fun

If you want to avoid the last join you can tweak the initial unpivot to include all the superfluous columns as part of the index and then you can use other tricks to reform the unpivoted columns at the end

(
    elements_df
    .unpivot(
        index=(index := ['element',*(other_cols:=[f"N_{x}" for x in range(4,11)])]),
        on=(on := [x for x in elements_df.columns if x not in index]),
        value_name='node',
        )
    .join(nodes_df, on='node')
    .group_by('element', maintain_order=True)
    .agg(pl.col('node'), pl.col(*other_cols).first(), pl.col('x','y','z').mean())
    .with_columns(pl.col('node').list.to_struct(fields=on))
    .unnest('node')
)