pythondataframevectorword-embeddingfasttext

What is the best way to save fastText word vectors in a dataframe as numeric values?


How to save fastText word vectors in dataframe better in order to use them for further calculations?

Hello everyone!

I have a question about fastText word vectors, namely, I'd like to know, how to save them in my dataframe as vectors, but not objects. I want the column with word vectors be a numeric value as my next step is to calculate the average between different word forms.

Right now I use the following line to save word vectors into my dataframe:

full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)

After getting word vectors I try to calculate the average but it does not work:

full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))

One of the ideas is to save word vectors to a list, then this list will be numpy.ndarray. But I am not sure, whether it is a good choice. I expect this array to have 300 dimentions as fastText word vectors have 300 dimentions, but, when I check the number of dim with arr.ndim attribute, I get 1. Shouldn't it be 300?

That's me first time asking for help here, so sorry if it is too messy. Thank you for help in advance! Have a nice day! Ana


Solution

  • For further calculations, usually the best approach is to not move the vectors into a DataFrame at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame's table/cells model.

    Rather, leave them as the numpy.ndarray objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size) matrix used by the FastText model itself to store all the words.

    Using numpy functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.

    For example, if word_list is a Python list of the words whose vectors you want to average:

    average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)