pythonpandasnumpycountvectorizer

Dataframe .join creates NaN valued column from actual values


What I want to do is create a bag of words for 11410 strings and then append at the end of the word columns the result that I have stored in another dataframe. I have a dataframe with the column 'result' which I am trying to append as a new column next to my existing bag-of-words dataframe. However, I get a column that is full of 'NaN' values.

My dataframe is 11410 x 111 in dimension, and I want to add my dataframe column as the new column at the end. My code is as follows

bow = vectorizer.fit_transform(df_train['text']) #creates the vectorizer with the bag of words

bow_df = pd.DataFrame(bow.toarray(),columns=vectorizer.get_feature_names_out()) # turn the result to a dataframe

res = df_train['result']      #column of the dataframe that I want to insert

bow_df = bow_df.join(res)     #this SHOULD (? but doesn't) do what I want

Therefore I end up with a 11410 x 112 but the last column is full of NaN's.

My res structure:

226115    POS
191228    NEU
198033    NEG
100300    NEU
208472    POS
         ... 
119879    POS
103694    NEU
131932    NEU
146867    NEU
121958    NEU

My bow_df structure:

 age ages also amp apollo approval approved arm astrazeneca aug  ...  \
0       0    0    0   0      0        0        0   0           0   0  ...   
1       0    0    0   0      0        0        0   0           0   0  ...   
2       0    0    0   0      0        0        0   0           0   0  ...   
3       0    0    0   0      0        0        0   0           0   0  ...   
4       0    0    0   0      0        0        1   0           0   0  ...   
...    ..  ...  ...  ..    ...      ...      ...  ..         ...  ..  ...   
11405   0    0    0   0      0        1        0   0           0   0  ...   
11406   0    0    0   0      0        0        0   0           0   0  ...   
11407   0    0    0   0      0        0        0   0           0   0  ...   
11408   1    0    0   0      0        0        0   0           0   1  ...   
11409   1    0    0   0      0        0        0   0           0   0  ...   

      urban us use vaccinated vaccination vaccine vaccines world would year  
0         0  0   0          0           0       0        0     0     0    0  
1         0  0   0          0           0       0        0     0     0    0  
2         0  0   0          0           0       0        0     0     0    0  
3         0  0   0          0           0       0        1     0     0    0  
4         0  0   0          0           0       1        0     0     0    0  
...     ... ..  ..        ...         ...     ...      ...   ...   ...  ...  
11405     0  0   1          0           0       0        0     0     0    0  
11406     0  0   0          0           0       0        0     0     0    0  
11407     0  0   0          0           0       0        0     0     0    0  
11408     0  0   0          0           0       0        0     0     0    0  
11409     0  0   0          0           0       0        0     0     0    0  

I even tried to bow_df = bow_df.astype(str) in case it was the type but didn't work.

Thanks everyone.


Solution

  • it is because the index are not matched. Try bow_df['result'] = res.values to remove the RHS index.