nlpentityspacynamed-entity-extraction

Spacy: Generate generic sentences and then train the model on top of that. Is it a good idea?


I am training a model from scratch to predict food items from the text. I have tagged around 500 sentences to train my model and the accuracy is pretty good. But, I am a bit worried about the unseen real-world data so I have come up with an interesting idea. So I wanted to know some experienced person thought in this interesting idea.

So the idea is to convert the 500 sentences into maybe 10000 sentences. For that, I have first I replaced the actual entity with tag and then filled with possible entities. Example of this as follows:

Original training Sentences:

  1. "Tesco sold fifty thousand pizza last year. " --- Food = pizza
  2. "He loves to eat pudding when he is alone." --- Food = pudding Generic Sentences:
  3. "Tesco sold fifty thousand last year. "
  4. "He loves to eat when he is alone."

Food List:

  1. pizza
  2. pudding

New generated training sentences:

  1. "Tesco sold fifty thousand pizza last year. " --- Food = pizza
  2. "Tesco sold fifty thousand pudding last year. " --- Food = pudding
  3. "He loves to eat pizza when he is alone." --- Food = pizza
  4. "He loves to eat pudding when he is alone." --- Food = pudding

So is this a good to generate training sentences like this. Benefits which I think:

  1. More sentences.
  2. The single entity will have more example instead of one or two.
  3. May be high accuracy.

Issues could be:

Thanks, Please let me know thoughts in this approach.


Solution

  • This approach is called augmenting training data with synthetic data.

    It can definitely be a very useful technique when your training data is limited. However, in my experience it should be used carefully or with moderation otherwise you run the risk of overfitting your model to the training data. In other words your model might have difficulty generalizing beyond the entities in your food list because it has seen those so many times during training and it comes to expect those. Also as you mentioned, overfitting may arise through through repeated sentence structures.

    This synthetic permutation data should be as generated as randomly as possible. One can use the sample() method in the python random library.
    For each sentence in the initial training data set, draw a small sample of foods from your list, and for each of the sampled foods substitute that food in the sentence to produce a new sentence.

    A slightly different approach (which can perhaps generalize better over unseen food entities) is instead of using the food list from your 500 training sentences, is to download a list of foods and use that.

    Lists of foods can be found on github, for example: here or here

    or extracted from wikipedia (here)

    In both cases, using a sample size of n, produces an n-fold increase in training data.