I am training a model from scratch to predict food items from the text. I have tagged around 500 sentences to train my model and the accuracy is pretty good. But, I am a bit worried about the unseen real-world data so I have come up with an interesting idea. So I wanted to know some experienced person thought in this interesting idea.
Original training Sentences:
Food List:
New generated training sentences:
So is this a good to generate training sentences like this. Benefits which I think:
Issues could be:
Thanks, Please let me know thoughts in this approach.
This approach is called augmenting training data with synthetic data.
It can definitely be a very useful technique when your training data is limited. However, in my experience it should be used carefully or with moderation otherwise you run the risk of overfitting your model to the training data. In other words your model might have difficulty generalizing beyond the entities in your food list because it has seen those so many times during training and it comes to expect those. Also as you mentioned, overfitting may arise through through repeated sentence structures.
This synthetic permutation data should be as generated as randomly as possible.
One can use the sample() method in the python random library.
For each sentence in the initial training data set, draw a small sample of foods from your list, and for each of the sampled foods substitute that food in the sentence to produce a new sentence.
A slightly different approach (which can perhaps generalize better over unseen food entities) is instead of using the food list from your 500 training sentences, is to download a list of foods and use that.
Lists of foods can be found on github, for example: here or here
or extracted from wikipedia (here)
In both cases, using a sample size of n, produces an n-fold increase in training data.