nlpstanford-nlpbert-language-modelcosine-similaritynlp-question-answering

How can I find the cosine similarity between two song lyrics represented as strings?


My friends and I are doing an NLP project on song recommendation.

Context: We originally planned on giving the model a recommended song playlist that has the most similar lyrics based on the random input corpus(from the literature etc), however we didn't really have a concrete idea of its implementation.

Currently our task is to find similar lyrics to a random lyric fed as a string input. We are using sentence BERT model(sbert) and cosine similarity to find the similarity between the songs and it seems like the output numbers are meaningful enough to find the most similar song lyrics.

Is there any other way that we can improve this approach?

We'd like to use BERT model and are open to suggestions that can be used on top of BERT if possible, but if there is any other models that should be used instead of BERT, we'd be happy to learn. Thanks.


Solution

  • Computing cosine similarity

    You can use the util.cos_sim(embeddings1, embeddings2) from the sentence-transformers package to compute the cosine similarity of two embeddings.

    Alternatively, you can also use sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True) from the scikit-learn package.

    Improvements for representation and models

    Since you want recommendations just on top of BERT, you can consider RoBERTa as well with Byte-pair encoding for tokenizer over BERT's Wordpeice tokenizers. Consider the roberta-base model as a feature extractor from the HuggingFacetransformers package.

    from transformers import RobertaTokenizer, RobertaModel
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    model = RobertaModel.from_pretrained('roberta-base')
    text = "song lyrics in text."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    Tokenizers work at various text granularity level of syntax & semantics. They help generate quality vectors/embeddings. Each can yield different and better results if fine-tunned for the correct task and model.

    Some other tokenizers you can consider are: Character Level BPE, Byte-Level BPE, WordPiece (BERT uses this), SentencePiece, and Unigram tokenizer with LM Character.

    Also consider exploring the HuggingFace official Tokenizer Library guide here.