How to save fastText word vectors in dataframe better in order to use them for further calculations?
Hello everyone!
I have a question about fastText word vectors, namely, I'd like to know, how to save them in my dataframe as vectors, but not objects. I want the column with word vectors be a numeric value as my next step is to calculate the average between different word forms.
Right now I use the following line to save word vectors into my dataframe:
full_forms["word_vec"] = full_forms.apply(lambda row: ft.get_word_vector(row["word_sg"]), axis=1)
After getting word vectors I try to calculate the average but it does not work:
full_forms["average"] = full_forms.apply(lambda row: row["word_vec":"word_vec_pl"].mean(axis=0))
One of the ideas is to save word vectors to a list, then this list will be numpy.ndarray. But I am not sure, whether it is a good choice. I expect this array to have 300 dimentions as fastText word vectors have 300 dimentions, but, when I check the number of dim with arr.ndim attribute, I get 1. Shouldn't it be 300?
That's me first time asking for help here, so sorry if it is too messy. Thank you for help in advance! Have a nice day! Ana
For further calculations, usually the best approach is to not move the vectors into a DataFrame
at all - which brings up these sorts of type/size issues, and adds more indirection & data-structure overhead from the DataFrame
's table/cells model.
Rather, leave them as the numpy.ndarray
objects they are – either the individual 300-dimension arrays, or in some cases the giant (number_of_words, vector_size)
matrix used by the FastText model itself to store all the words.
Using numpy
functions directly on those will generally lead to the most concise & efficient code, with the least memory overhead.
For example, if word_list
is a Python list of the words whose vectors you want to average:
average_vector = np.mean([ft.get_word_vector(word) for word in word_list], axis=0)