pythonpython-polars

Reference polars.DataFrame.height in with_columns


Take this example:

df = (polars
  .DataFrame(dict(
    j=polars.datetime_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 3), '8h', closed='left', eager=True),
    ))
  .with_columns(
    k=polars.lit(numpy.random.randint(10, 99, 6)),
    )
  )

 j                    k
 2023-01-01 00:00:00  47
 2023-01-01 08:00:00  22
 2023-01-01 16:00:00  82
 2023-01-02 00:00:00  19
 2023-01-02 08:00:00  85
 2023-01-02 16:00:00  15
shape: (6, 2)

Here, numpy.random.randint(10, 99, 6) uses hard-coded 6 as the height of DataFrame, so it won't work if I changed e.g. the interval from 8h to 4h (which would require changing 6 to 12).

I know I can do it by breaking the chain:

df = polars.DataFrame(dict(
  j=polars.datetime_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 3), '4h', closed='left', eager=True),
  ))

df = df.with_columns(
  k=polars.lit(numpy.random.randint(10, 99, df.height)),
  )

 j                    k
 2023-01-01 00:00:00  47
 2023-01-01 04:00:00  22
 2023-01-01 08:00:00  82
 2023-01-01 12:00:00  19
 2023-01-01 16:00:00  85
 2023-01-01 20:00:00  15
 2023-01-02 00:00:00  89
 2023-01-02 04:00:00  74
 2023-01-02 08:00:00  26
 2023-01-02 12:00:00  11
 2023-01-02 16:00:00  86
 2023-01-02 20:00:00  81
shape: (12, 2)

Is there a way to do it (i.e. reference df.height or an equivalent) in one chained expression though?


Solution

  • You can use .pipe()

    df = (
       pl.datetime_range(
          datetime.date(2023, 1, 1), 
          datetime.date(2023, 1, 3), 
          "4h", 
          closed="left", 
          eager=True
       )
       .alias("date")
       .to_frame()
    )
    
    df.pipe(lambda df: 
        df.with_columns(pl.lit(np.random.randint(10, 99, df.height)).alias("rand"))
    )
    
    shape: (12, 2)
    ┌─────────────────────┬──────┐
    │ date                ┆ rand │
    │ ---                 ┆ ---  │
    │ datetime[μs]        ┆ i64  │
    ╞═════════════════════╪══════╡
    │ 2023-01-01 00:00:00 ┆ 39   │
    │ 2023-01-01 04:00:00 ┆ 45   │
    │ 2023-01-01 08:00:00 ┆ 95   │
    │ 2023-01-01 12:00:00 ┆ 72   │
    │ …                   ┆ …    │
    │ 2023-01-02 08:00:00 ┆ 34   │
    │ 2023-01-02 12:00:00 ┆ 42   │
    │ 2023-01-02 16:00:00 ┆ 30   │
    │ 2023-01-02 20:00:00 ┆ 83   │
    └─────────────────────┴──────┘
    

    As for the example task, perhaps .sample() could be used.

    df.with_columns(
       pl.int_range(10, 100).sample(pl.len(), with_replacement=True).alias("rand")
    )
    
    shape: (12, 2)
    ┌─────────────────────┬──────┐
    │ date                ┆ rand │
    │ ---                 ┆ ---  │
    │ datetime[μs]        ┆ i64  │
    ╞═════════════════════╪══════╡
    │ 2023-01-01 00:00:00 ┆ 25   │
    │ 2023-01-01 04:00:00 ┆ 27   │
    │ 2023-01-01 08:00:00 ┆ 68   │
    │ 2023-01-01 12:00:00 ┆ 95   │
    │ 2023-01-01 16:00:00 ┆ 96   │
    │ …                   ┆ …    │
    │ 2023-01-02 04:00:00 ┆ 36   │
    │ 2023-01-02 08:00:00 ┆ 25   │
    │ 2023-01-02 12:00:00 ┆ 90   │
    │ 2023-01-02 16:00:00 ┆ 92   │
    │ 2023-01-02 20:00:00 ┆ 92   │
    └─────────────────────┴──────┘