
From "descriptions with typos" to "labels"


I do have an image dataset (similar to ImageNet) which comes with a "description with typos" per each image. I would like to run some deep convolutional neural network on this guy, but I need to generate the "labels" first. So, here's the question:


How to generate categories' "label" from "descriptions with typos"?

Technical information

The dataset has around 13M images with corresponding (valid) "description" and optional "typos". Some examples of "descriptions" follow below:

First example Second example


I was thinking to approach the problem in the following way.

  1. Fix typos:
    • Run a spell check to identify spelling errors;
    • Find the better word that could fix it, by
      • looking at other descriptions in the dataset, or
      • checking the image and correcting the typo manually;
  2. Generate the final labels:
    • run a clustering algorithm (k-means, for example) on a sentence embedding (function that maps sentences into a ℝᴺ) or
    • use the most recurrent words.


  • Here some ideas:

    1. You should clearly run a spell checking, otherwise your labels will be even more noisy. Options:
    1. Regarding labeling (I guess you want it automatically otherwise there are semi automatic methods):