rtextnlptext-miningcorpus

Most frequent phrases from text data in R


Does anyone here have experience in identifying the most common phrases (3 ~ 7 consecutive words)? Understand that most analysis on frequency focuses on the most frequent/common word (along with plotting a WordCloud) rather than phrases.

# Assuming a particular column in a data frame (df) with n rows that is all text data
# as I'm not able to provide a sample data as using dput() on a large text file won't # be feasible here 

Text = df$Text_Column
docs = Corpus(VectorSource(Text))
...

Thanks in advance!


Solution

  • You have several options to do this in R. Let's grab some data first. I use the books by Jane Austen from the janeaustenr and do some cleaning to have each paragrah in a separate row:

    library(janeaustenr)
    library(tidyverse)
    books <- austen_books() %>% 
      mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
      group_by(paragraph) %>% 
      summarise(book = head(book, 1),
                text = trimws(paste(text, collapse = " ")),
                .groups = "drop")
    

    With tidytext:

    library(tidytext)
    map_df(3L:7L, ~unnest_tokens(books, ngram, text, token = "ngrams", n = .x)) %>% # using multiple values for n is not directly implemented in tidytext
      count(ngram) %>%
      filter(!is.na(ngram)) %>% 
      slice_max(n, n = 10)
    #> # A tibble: 10 × 2
    #>    ngram               n
    #>    <chr>           <int>
    #>  1 i am sure         415
    #>  2 i do not          412
    #>  3 she could not     328
    #>  4 it would be       258
    #>  5 in the world      247
    #>  6 as soon as        236
    #>  7 a great deal      214
    #>  8 would have been   211
    #>  9 she had been      203
    #> 10 it was a          202
    
    

    With quanteda:

    library(quanteda)
    books %>% 
      corpus(docid_field = "paragraph",
             text_field = "text") %>% 
      tokens(remove_punct = TRUE,
             remove_symbols = TRUE) %>% 
      tokens_ngrams(n = 3L:7L) %>%
      dfm() %>% 
      topfeatures(n = 10) %>% 
      enframe()
    #> # A tibble: 10 × 2
    #>    name            value
    #>    <chr>           <dbl>
    #>  1 i_am_sure         415
    #>  2 i_do_not          412
    #>  3 she_could_not     328
    #>  4 it_would_be       258
    #>  5 in_the_world      247
    #>  6 as_soon_as        236
    #>  7 a_great_deal      214
    #>  8 would_have_been   211
    #>  9 she_had_been      203
    #> 10 it_was_a          202
    

    With text2vec:

    library(text2vec)
    library(janeaustenr)
    library(tidyverse)
    books <- austen_books() %>% 
      mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>% 
      group_by(paragraph) %>% 
      summarise(book = head(book, 1),
                text = trimws(paste(text, collapse = " ")),
                .groups = "drop")
    
    library(text2vec)
    itoken(books$text, tolower, word_tokenizer) %>% 
      create_vocabulary(ngram = c(3L, 7L), sep_ngram = " ") %>% 
      filter(str_detect(term, "[[:alpha:]]")) %>% # keep terms with at tleas one alphabetic character
      slice_max(term_count, n = 10)
    #> Number of docs: 10293 
    #> 0 stopwords:  ... 
    #> ngram_min = 3; ngram_max = 7 
    #> Vocabulary: 
    #>                term term_count doc_count
    #>  1:       i am sure        415       384
    #>  2:        i do not        412       363
    #>  3:   she could not        328       288
    #>  4:     it would be        258       233
    #>  5:    in the world        247       234
    #>  6:      as soon as        236       233
    #>  7:    a great deal        214       209
    #>  8: would have been        211       192
    #>  9:    she had been        203       179
    #> 10:        it was a        202       194
    

    Created on 2022-08-03 by the reprex package (v2.0.1)