pythonmachine-learningnlpword2vecgensim

Using word2vec to classify words in categories


BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. Take new input.
  2. Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.


Solution

  • If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

    Here's my solution below:

    import numpy as np
    
    # Category -> words
    data = {
      'Names': ['john','jay','dan','nathan','bob'],
      'Colors': ['yellow', 'red','green'],
      'Places': ['tokyo','bejing','washington','mumbai'],
    }
    # Words -> category
    categories = {word: key for key, words in data.items() for word in words}
    
    # Load the whole embedding matrix
    embeddings_index = {}
    with open('glove.6B.100d.txt') as f:
      for line in f:
        values = line.split()
        word = values[0]
        embed = np.array(values[1:], dtype=np.float32)
        embeddings_index[word] = embed
    print('Loaded %s word vectors.' % len(embeddings_index))
    # Embeddings for available words
    data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}
    
    # Processing the query
    def process(query):
      query_embed = embeddings_index[query]
      scores = {}
      for word, embed in data_embeddings.items():
        category = categories[word]
        dist = query_embed.dot(embed)
        dist /= len(data[category])
        scores[category] = scores.get(category, 0) + dist
      return scores
    
    # Testing
    print(process('pink'))
    print(process('frank'))
    print(process('moscow'))
    

    In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

    {'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
    {'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
    {'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}
    

    ... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.