I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance:
P for Plant, V for Vegetables and F for Fruits. Also,
A for Animal, I for Insects and M for Mammals.
data = {'Text1': ["All Vegetables are Plants",
"Cows are happy",
"Butterflies are really beautiful",
"I enjoy Mangoes",
"Vegetables are green"],
'Text2': ['Some Plants are good Vegetables',
'Cows are enjoying',
'Beautiful butterflies are delightful to watch',
'Mango pleases me',
'Spinach is green'],
'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}
df = pd.DataFrame(data)
print(df)
>>>
Text1 | Text2 | Relationship | |
---|---|---|---|
0 | All Vegetables are Plants | Some Plants are good Vegetables | PV123 |
1 | Cows eat grass | Grasses are cow's food | AM4355 |
2 | Butterflies are really beautiful | Beautiful butterflies are delightful to watch | AI784 |
3 | I enjoy Mangoes | Mango pleaases me | PF897 |
4 | Vegetables are green | Spinach is green | PV776 |
I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2
and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.
How can I shuffle my data so I can avoid texts like All Vegetables are Plants
and Spinach is green
appearing on the same row on Text1
and Text2
respectively?
I resolved this by:
With this, I was able to really create semantically dissimilar pairs.