labelspell-checkingtext-processingtext-parsingword-cloud

From "descriptions with typos" to "labels"


Background

I do have an image dataset (similar to ImageNet) which comes with a "description with typos" per each image. I would like to run some deep convolutional neural network on this guy, but I need to generate the "labels" first. So, here's the question:

Question

How to generate categories' "label" from "descriptions with typos"?

Technical information

The dataset has around 13M images with corresponding (valid) "description" and optional "typos". Some examples of "descriptions" follow below:

First example Second example

Ideas

I was thinking to approach the problem in the following way.

  1. Fix typos:
    • Run a spell check to identify spelling errors;
    • Find the better word that could fix it, by
      • looking at other descriptions in the dataset, or
      • checking the image and correcting the typo manually;
  2. Generate the final labels:
    • run a clustering algorithm (k-means, for example) on a sentence embedding (function that maps sentences into a ℝᴺ) or
    • use the most recurrent words.

Solution

  • Here some ideas:

    1. You should clearly run a spell checking, otherwise your labels will be even more noisy. Options:
    1. Regarding labeling (I guess you want it automatically otherwise there are semi automatic methods):