rnlptext-miningdata-wranglingquanteda

Creating a token count by date and co-occurence term proportion by date using quanteda


I have a quite massive dataset that contains reviews of utilities services from customers all over the UK, this is a small sample of what the data looks like:

df <- data.frame (text  = c("The investors and their supporters shall invest and do something mostly invest",
         " Shall we tell the investors to invest ?",  "Investors shall invest.",
         "Investors may sometimes invest","spend what Investor Do"),
                  date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))

What I want is to be able to count the frequency of terms/words/tokens by date.

For instance, the word invest appears 6 times in total, so for the date 10/12/2022 its word count is 4 I want to be able to use the quanteda library (since it is so powerful) to achieve the count and plot the viz over date.

I also want to plot the association or co-occurence of the word investor & invest over date

For instance, we have in this example 5 reviews in those reviews 4/5 times the word invest and investor were present and I'd like to plot that percentage over date as well. Is that is possible? What amazing options does the quantada lib has that can perform this task? Will it be possible to also find lets say a min percentage of the 0.25 most frequent words that appear when "invest" appears?

To achieve the first point I started with the following code:

df %>% 
  corpus(text_field="text") %>% 
  dfm() %>%
  textstat_frequency(10)

which gives:

      feature frequency rank docfreq group
1      invest         6    1       5   all
2   investors         4    2       4   all
3       shall         3    3       3   all
4         the         2    4       2   all
5         and         2    4       1   all
6          do         2    4       2   all
7       their         1    7       1   all
8  supporters         1    7       1   all
9   something         1    7       1   all
10         we         1    7       1   all
Warning message:
'dfm.corpus()' is deprecated. Use 'tokens()' first. 

How would I go about plotting the frequency of this words over the date column? I read in the documentation that one can group but I had have no luck in doing so.

And for the second question I don't know for sure if what function of the quenteda lib to use but I am trying to mirror the tm::findAssocs() fun from the tm library.


Solution

  • Answer to your first question:

    The dates are put into the docvars part of your corpus. This can be used within the textstat_frequency with the group option.

    dat <- data.frame (text  = c("The investors and their supporters shall invest and do something mostly invest",
                                " Shall we tell the investors to invest ?",  "Investors shall invest.",
                                "Investors may sometimes invest","spend what Investor Do"),
                      date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
    
    
    library(dplyr)
    library(quanteda)
    library(quanteda.textstats)
    
    dat %>% 
      corpus(text_field="text") %>% 
      tokens() %>%
      dfm() %>% 
      textstat_frequency(groups = date)
    
          feature frequency rank docfreq      group
    1      invest         4    1       3 10/12/2022
    2   investors         3    2       3 10/12/2022
    3       shall         3    2       3 10/12/2022
    4         the         2    4       2 10/12/2022
    5         and         2    4       1 10/12/2022
    6       their         1    6       1 10/12/2022
    7  supporters         1    6       1 10/12/2022
    8          do         1    6       1 10/12/2022
    9   something         1    6       1 10/12/2022
    10     mostly         1    6       1 10/12/2022
    11         we         1    6       1 10/12/2022
    12       tell         1    6       1 10/12/2022
    13         to         1    6       1 10/12/2022
    14          ?         1    6       1 10/12/2022
    15          .         1    6       1 10/12/2022
    16  investors         1    1       1 11/12/2022
    17     invest         1    1       1 11/12/2022
    18        may         1    1       1 11/12/2022
    19  sometimes         1    1       1 11/12/2022
    20         do         1    1       1 12/12/2022
    21      spend         1    1       1 12/12/2022
    22       what         1    1       1 12/12/2022
    23   investor         1    1       1 12/12/2022
    

    You now have access to the frequency per day.

    As for question 2, I think you can use textstat_simil. Something like below. It does give some different answers as using tm::findAssoc, usually more features. So I'm not completely sure if this is the correct answer. Maybe someone from the quanteda team can confirm or deny this.

    my_dfm <- dat %>% 
      corpus(text_field="text") %>% 
      tokens() %>%
      dfm()
    
    textstat_simil(my_dfm, 
                   my_dfm[, c("investor")], 
                   method = "correlation", 
                   margin = "features",
                   min_simil = 0.7)
    
    textstat_simil object; method = "correlation"
               investor
    the               .
    investors         .
    and               .
    their             .
    supporters        .
    shall             .
    invest            .
    do                .
    something         .
    mostly            .
    we                .
    tell              .
    to                .
    ?                 .
    .                 .
    may               .
    sometimes         .
    spend             1
    what              1
    investor          1
    

    You can save the outcome of textstat_simil as a data.frame or list if you want to with as.data.frame or as.list.