rloopstext2vec

looping to tokenize using text2vec


Edited to shorten and provide sample data.

I have text data consisting of 8 questions asked of a number of participants twice. I want to use text2vec to compare the similarity of their responses to these questions at the two points in time (duplicate detection). Here is how my initial data is structured (in this example there are just 3 participants, 4 questions instead of 8, and 2 quarters/time periods). I want to do similarity comparison for each participant's response in the first quarter vs. the second quarter. I intend to use package text2vec's psim command to do this.

df<-read.table(text="ID,Quarter,Question,Answertext
               Joy,1,And another question,adsfjasljsdaf jkldfjkl
               Joy,2,And another question,dsadsj jlijsad jkldf 
               Paul,1,And another question,adsfj aslj sd afs dfj ksdf
               Paul,2,And another question,dsadsj jlijsad
               Greg,1,And another question,adsfjasljsdaf
               Greg,2,And another question, asddsf asdfasd sdfasfsdf
               Joy,1,this is the first question that was asked,this is joys answer to this question
               Joy,2,this is the first question that was asked,this is joys answer to this question
               Paul,1,this is the first question that was asked,this is Pauls answer to this question
               Paul,2,this is the first question that was asked,Pauls answer is different 
               Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
               Greg,2,this is the first question that was asked,this is Gregs answer to this question
               Joy,1,This is the text of another question,more random text
               Joy,2,This is the text of another question, adkjjlj;ds sdafd
               Paul,1,This is the text of another question,more random text
               Paul,2,This is the text of another question, adkjjlj;ds sdafd
               Greg,1,This is the text of another question,more random text
               Greg,2,This is the text of another question,sdaf asdfasd asdff
               Joy,1,this was asked second.,some random text
               Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
               Paul,1,this was asked second.,some random text
               Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
               Greg,1,this was asked second.,some random text
               Greg,2,this was asked second.,ada dasdffasdf asdf  asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')

I've done some more thinking and I believe the right approach is to split the dataframe into a list of dataframes, not separate items.

questlist<-split(df,f=df$Question)

then write a function to create the vocabulary for each question.

library(text2vec)

vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }

test<-lapply(questlist, vocabmkr)

But then I think I need to split the original dataframe into question-quarter combinations and apply the vocab from the other list to it and am not sure how to go about that.

Ultimately, I want a similarity score telling me if the participants are duplicating some or all of their responses from the first and second quarters.

EDIT: Here is how I would do this for a single question starting with the above dataframe.

quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()

quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1

quest1q2<-filter(quest1,Quarter==2) 
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2

#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)

similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.

b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)

Edit: Ok, I have worked with the lapply some more.

df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question

vocabmkr<-function(x) {
  itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}

vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.

dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)

How do I apply the vocab list (consisting of 4 elements) to the dfqq list (consisting of 8)?


Solution

  • I'm sorry, that sounds frustrating. In case you have more to do and did want a more automatic way to do it, here's one approach that might work for you:

    First, convert your example code for a single dataframe into a function:

    analyze_vocab <- function(df_) {
      quest1vocab =
        itoken(as.character(df_$Answertext), ids = df_$ID) %>%
        create_vocabulary() %>%
        prune_vocabulary(term_count_min = 1) %>%
        vocab_vectorizer()
    
      quarter1 = filter(df_, Quarter == 1)
      quarter1 = itoken(as.character(quarter1$Answertext), 
                        ids = quarter1$ID)
    
      quarter2 = filter(df_, Quarter == 2)
      quarter2 = itoken(as.character(quarter2$Answertext),
                        ids = quarter2$ID)
    
      q1mat = create_dtm(quarter1, quest1vocab)
      q2mat = create_dtm(quarter2, quest1vocab)
    
      similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")
    
      b = data.frame(
        ID = names(similarity),
        Similarity = similarity)
    
      output <- full_join(b, df_)
      return(output)
    }
    

    Now, you can split if you want and then use lapply like this: lapply(split(df, df$Question), analyze_vocab). However, you already seem comfortable with piping so you might as well go with that approach:

    similarity_df <- df %>% 
      group_by(Question) %>%
      do(analyze_vocab(.))
    

    Output:

    > head(similarity_df, 12)
    # A tibble: 12 x 5
    # Groups:   Question [2]
       ID    Similarity Quarter Question                                  Answertext                                           
       <fct>      <dbl>   <int> <fct>                                     <fct>                                                
     1 Joy        0           1 And another question                      adsfjasljsdaf jkldfjkl                               
     2 Joy        0           2 And another question                      "dsadsj jlijsad jkldf "                              
     3 Paul       0           1 And another question                      adsfj aslj sd afs dfj ksdf                           
     4 Paul       0           2 And another question                      dsadsj jlijsad                                       
     5 Greg       0           1 And another question                      adsfjasljsdaf                                        
     6 Greg       0           2 And another question                      " asddsf asdfasd sdfasfsdf"                          
     7 Joy        1           1 this is the first question that was asked this is joys answer to this question                 
     8 Joy        1           2 this is the first question that was asked this is joys answer to this question                 
     9 Paul       0.429       1 this is the first question that was asked this is Pauls answer to this question                
    10 Paul       0.429       2 this is the first question that was asked "Pauls answer is different "                         
    11 Greg       0.667       1 this is the first question that was asked this is Gregs answer to this question nearly the same
    12 Greg       0.667       2 this is the first question that was asked this is Gregs answer to this question 
    

    The values in similarity match the ones shown in your example endproduct (note that values shown are rounded for tibble display), so it seems to be working as intended.