pythonpython-3.xnlpstanford-nlptorch

Calculating similarity score in contexto.me clone


I am currently trying to clone the popular browser game contexto.me and I am having trouble with as to how to calculate the similarity score between two words (the target word and the user inputted guess word). I am able to get the cosine similarity between the two words, but as to how to properly quantify the score into a clean integer like in the game, I am confused as to how it is done.

For example, if the target word is 'helicopter' and I guess the word plane, contexto will return something like a similarity score of 13, but if I guess a word like 'king' contexto will return a score of '2000' for instance.

target_word = "helicopter"
glove = torchtext.vocab.GloVe(name="6B", dim=100)


@app.route('/', methods=["GET", "POST"])
def getSimScore():
    if request.method == "POST":
        text = request.form.get("word")
        new_text = singularize(text)
        sim_score = ((torch.cosine_similarity(glove[target_word].unsqueeze(0), glove[new_text].unsqueeze(0))).numpy()[0])
        print(sim_score)
    return render_template('homepage.html', messageText='sample text', gameNum=1, guessNum=1, wordAccuracy=999)

This is my code so far with sim_score printing to be ~0.77 for the input 'truck' and ~0.29 for the input 'king' (closer to 1 the more similar the word is to the target word).


Solution

  • For example, if the target word is 'helicopter' and I guess the word plane, contexto will return something like a similarity score of 13, but if I guess a word like 'king' contexto will return a score of '2000' for instance.

    This metric is typically called "rank," and you can calculate it with the following algorithm.

    1. Compute the similarity score of every word the user can enter.
    2. Sort this list.
    3. Given a specific score, find what position it appears on the list. If the score appears at index 0, then it is rank 1. If it appears at index 4, then it is rank 5, and so on.

    For speed, steps 1 and 2 can be computed ahead of time, if you want.