rdataframespacypart-of-speech

Join multiple values into same cell R


I have a data frame with pos values for each document split down into single tokens. How can I merge the individual pos values into one single cell separated by a comma? So now I have something like

  doc_id sentence_id token_id    token  pos entity
1  text1           1        1   xxxxxx PRON       
2  text1           1        2     xxxx  AUX       
3  text1           1        3      xxx  AUX       
4  text1           1        4  xxxxxxx VERB       
5  text2           1        5     xxxx  DET       
6  text2           1        6      xxx NOUN  

How can I make it into

  doc_id                      pos    entity
1  text1  PRON, AUX, AUX, VERB...       
2  text2  AUX, NOUN, PRON, ADJ...       
3  text3  ...
4  text4  ...  
5  text5  ...
6  text6  ...

Do I need to create a new data frame or is there a Spacy function that can do this directly? Thank you


Solution

  • You can collapse it like so:

    aggregate(pos ~ doc_id, doc_df, paste, collapse = ", ")
    

    You can store this in a separate dataframe and merge in any other columns you want to include from the original, or if you just need these two then you can use this directly.