pythonpandasdataframeshufflesentence-similarity

Shuffle pandas column while avoiding a condition


I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance:
P for Plant, V for Vegetables and F for Fruits. Also,
A for Animal, I for Insects and M for Mammals.

data = {'Text1': ["All Vegetables are Plants",
                   "Cows are happy",
                   "Butterflies are really beautiful",
                   "I enjoy Mangoes",
                   "Vegetables are green"],
        'Text2': ['Some Plants are good Vegetables',
                  'Cows are enjoying',
                  'Beautiful butterflies are delightful to watch',
                  'Mango pleases me',
                  'Spinach is green'],
        'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}

df = pd.DataFrame(data)

print(df)

>>>
Text1 Text2 Relationship
0 All Vegetables are Plants Some Plants are good Vegetables PV123
1 Cows eat grass Grasses are cow's food AM4355
2 Butterflies are really beautiful Beautiful butterflies are delightful to watch AI784
3 I enjoy Mangoes Mango pleaases me PF897
4 Vegetables are green Spinach is green PV776

I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2 and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.

How can I shuffle my data so I can avoid texts like All Vegetables are Plants and Spinach is green appearing on the same row on Text1 and Text2 respectively?


Solution

  • I resolved this by:

    1. Creating a new column with the first 2 letters from the relationship column.
    2. Used this new column to create a multi-index. A groupby on this new column should work hear as well.
    3. For each group, I populated Text2 using texts from other groups.
    4. I concatenated back all my newly modified groups.

    With this, I was able to really create semantically dissimilar pairs.