I have created the following pandas dataframe:
ds = {'col1':[1.0,2.1,2.2,3.1,41,5.2,5.0,6.1,7.1,10]}
df = pd.DataFrame(data=ds)
The dataframe looks like this:
print(df)
col1
0 1.0
1 2.1
2 2.2
3 3.1
4 41.0
5 5.2
6 5.0
7 6.1
8 7.1
9 10.0
I need to create a random 80% / 20% partition of the dataset and I also need to create a field (called buildFlag
) which shows whether a record belongs to the 80% partition (buildFlag = 1
) or belongs to the 20% partition (buildFlag = 0
).
For example, the resulting dataframe would like like:
col1 buildFlag
0 1.0 1
1 2.1 1
2 2.2 1
3 3.1 0
4 41.0 1
5 5.2 0
6 5.0 1
7 6.1 1
8 7.1 1
9 10.0 1
The buildFlag
values are assigned randomly.
Can anyone help me, please?
SOLUTION (PANDAS + NUMPY)
A possible solution, which:
First, using np.random.choice
to randomly choose 80% of df
indices without replacement.
The df.index.isin
function then checks each row's index to see if it was selected.
Finally, np.where
assigns a 1 to the Flag
column for selected indices and a 0 for the others.
df.assign(Flag=np.where(
df.index.isin(np.random.choice(
df.index, size=int(0.8 * len(df)),
replace=False)),
1, 0))
SOLUTION (PANDAS + SKLEARN)
Alternatively, we can use scikit-learn
's train_test_split
function:
First, it randomly splits the df
's indices into two groups: 80% for training and 20% for testing, as specified by test_size=0.2
.
The training indices are extracted using [0]
. The df.index.isin
method then checks which indices belong to the training set, producing a boolean array.
Finally, this boolean array is converted to integers (1
for True
and 0
for False
) using .astype(int)
.
from sklearn.model_selection import train_test_split
df.assign(Flag = df.index.isin(
train_test_split(df.index, test_size=0.2, random_state=42)[0]).astype(int))
Ouput:
col1 Flag
0 1.0 0
1 2.1 1
2 2.2 1
3 3.1 1
4 41.0 1
5 5.2 1
6 5.0 1
7 6.1 1
8 7.1 1
9 10.0 0