[SOLVED] How do I do a train and test split in a polars dataframe

How do I do a train and test split in a polars dataframe

I am trying to find a simple way of randomly splitting a polars dataframe in train and test. This is how I am doing it right now

train, test = df
  .with_columns(pl.lit(np.random.rand(df0.height)>0.8).alias('split'))
  .partition_by('split')

however, this leaves an extra split column hanging in my dataframes that I need to drop after.

Solution

There is an open feature request for allowing .partition_by to drop keys.

As discussed in the comments, it is possible to shuffle a dataframe using .sample()

df = pl.DataFrame({"val": range(100)})

df = df.sample(fraction=1, shuffle=True)

shape: (100, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 64  │
│ 40  │
│ 39  │
│ 98  │
│ …   │
│ 21  │
│ 29  │
│ 87  │
│ 99  │
└─────┘

Which could then be split into parts e.g. using .head and .tail

test_size = 20
test, train = df.head(test_size), df.tail(-test_size)

>>> test
shape: (20, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 60  │
│ 24  │
│ 96  │
│ 94  │
│ …   │
│ 50  │
│ 54  │
│ 56  │
│ 33  │
└─────┘

>>> train
shape: (80, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 87  │
│ 38  │
│ 6   │
│ 37  │
│ …   │
│ 93  │
│ 77  │
│ 8   │
│ 23  │
└─────┘