python-polars

How do I do a train and test split in a polars dataframe


I am trying to find a simple way of randomly splitting a polars dataframe in train and test. This is how I am doing it right now

train, test = df
  .with_columns(pl.lit(np.random.rand(df0.height)>0.8).alias('split'))
  .partition_by('split')

however, this leaves an extra split column hanging in my dataframes that I need to drop after.


Solution

  • There is an open feature request for allowing .partition_by to drop keys.

    As discussed in the comments, it is possible to shuffle a dataframe using .sample()

    df = pl.DataFrame({"val": range(100)})
    
    df = df.sample(fraction=1, shuffle=True)
    
    shape: (100, 1)
    ┌─────┐
    │ val │
    │ --- │
    │ i64 │
    ╞═════╡
    │ 64  │
    │ 40  │
    │ 39  │
    │ 98  │
    │ …   │
    │ 21  │
    │ 29  │
    │ 87  │
    │ 99  │
    └─────┘
    

    Which could then be split into parts e.g. using .head and .tail

    test_size = 20
    test, train = df.head(test_size), df.tail(-test_size)
    
    >>> test
    shape: (20, 1)
    ┌─────┐
    │ val │
    │ --- │
    │ i64 │
    ╞═════╡
    │ 60  │
    │ 24  │
    │ 96  │
    │ 94  │
    │ …   │
    │ 50  │
    │ 54  │
    │ 56  │
    │ 33  │
    └─────┘
    
    >>> train
    shape: (80, 1)
    ┌─────┐
    │ val │
    │ --- │
    │ i64 │
    ╞═════╡
    │ 87  │
    │ 38  │
    │ 6   │
    │ 37  │
    │ …   │
    │ 93  │
    │ 77  │
    │ 8   │
    │ 23  │
    └─────┘