pythonpandasmachine-learningscikit-learndata-augmentation

How to avoid data leakage when using data augmentation?


I am developing a classification problem that uses data augmentation. To do this, I have already extracted features from the copies by adding noise and other features. However, I want to avoid data leakage, which can happen when the copy is in the training set and the original is in the test set, for example.

I started testing some solutions, and I arrived at the code below. However, I do not know if the current solution can prevent this problem.

Basically, I have the original base (df) and the base with the characteristics of the copies (df2). When I split the df in training and testing, I look for the copies in df2 so that they are together with the original data, both in training and in testing.

Can someone help me?

Here is the code:

df = pd.read_excel("/content/drive/MyDrive/data/audio.xlsx")
df2 = pd.read_excel("/content/drive/MyDrive/data/audioAUG.xlsx")
X = df.drop('emotion', axis = 1)
y = df['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42, stratify=y)
X_train_AUG = df2[df2['id'].isin(X_train.id.to_list())]
X_test_AUG = df2[df2['id'].isin(X_test.id.to_list())]
X_train = X_train.append(X_train_AUG.loc[:, ~X_train_AUG.columns.isin(['emotion'])])
X_test =  X_test.append(X_test_AUG.loc[:, ~X_test_AUG.columns.isin(['emotion'])])
y_train_AUG = X_train_AUG.loc[:, X_train_AUG.columns.isin(['emotion'])]
y_test_AUG =  X_test_AUG.loc[:, X_test_AUG.columns.isin(['emotion'])]
y_train_AUG = y_train_AUG.squeeze()
y_test_AUG  = y_test_AUG.squeeze()
y_train = y_train.append(y_train_AUG)
y_test =  y_test.append(y_test_AUG)

Solution

  • short answer, your splitting procedure is fine however I would personally split both df1 and df2 by 75-25% of the length of both ( if both have the same size) because I don't know how your df2 as an augmented df1 data generated. I think if those ['id'] are in order it's fine. ( for example, if all of the data are sorted and in ascending order in both data frame) e.x

    train_len = int(0.75*len(df1))
    train_data = df[:train_len] #something like this
    data_AUG = df2[:train_len] 
    

    and applying the same thing you have mentioned for whatever is in dfa2 for your data augmentation. this would guarantee to prevent of any data leakage.(as far as i concerned these are one-by-one data)

    or maybe a better way, generate augmented data from split data from the start.(generating those from the 75% of data which will be used in model)