I am trying to find a simple way of randomly splitting a polars dataframe in train and test. This is how I am doing it right now
train, test = df
.with_columns(pl.lit(np.random.rand(df0.height)>0.8).alias('split'))
.partition_by('split')
however, this leaves an extra split column hanging in my dataframes that I need to drop after.
There is an open feature request for allowing .partition_by
to drop keys.
As discussed in the comments, it is possible to shuffle a dataframe using .sample()
df = pl.DataFrame({"val": range(100)})
df = df.sample(fraction=1, shuffle=True)
shape: (100, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 64 │
│ 40 │
│ 39 │
│ 98 │
│ … │
│ 21 │
│ 29 │
│ 87 │
│ 99 │
└─────┘
Which could then be split into parts e.g. using .head
and .tail
test_size = 20
test, train = df.head(test_size), df.tail(-test_size)
>>> test
shape: (20, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 60 │
│ 24 │
│ 96 │
│ 94 │
│ … │
│ 50 │
│ 54 │
│ 56 │
│ 33 │
└─────┘
>>> train
shape: (80, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 87 │
│ 38 │
│ 6 │
│ 37 │
│ … │
│ 93 │
│ 77 │
│ 8 │
│ 23 │
└─────┘