rfrequencytext-mininggreplterm-document-matrix

Create Frequency table using R and Term document Matrix


I have created the following dataframe consisting of a few e-mail subject lines.

 df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
                            'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))

I have created a list of frequent words derived from the above dataframe. I have added these keywords to the dataframe and dummy coded them as 0

 most_freq_words <- c('Free', 'New', 'Limited', 'Offer')



Subject                                               Free New Limited Offer                                                    

 'Free Free Free! Clear Cover with New Phone',          0   0     0      0
 'Offer ! Buy New phone and get earphone at             0   0     0      0
 1000. Limited Offer!'

I want to obtain a frequency count of the words in the e mail subject. The output should as follows

  Subject                                             Free New Limited Offer                                                    

 'Free Free Free!  Clear Cover with New Phone',         3   1     0      0
 'Offer ! Buy New phone and get earphone at             0   1     1      2
 1000. Limited Offer!'

I have tried the following code

for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]), 
tolower(df$subject)))}

This however tells if the word is present or not in the sentence. I need the output given above. I request someone to help me


Solution

  • Here is another option with tidyverse. We use map to loop over the 'most_freq_words', get its count from 'subject' column of 'df' with str_count, convert to tibble, set the names of the column from the 'most_freq_words' and bind the columns with the original dataset 'df'

    library(tidyverse)
    most_freq_words %>% 
          map(~ str_count(df$subject, .x) %>%
                        as_tibble %>% 
                        set_names(.x)) %>% 
          bind_cols(df, .)
    #                                                         subject Free New Limited Offer
    #1                 Free ! Free! Free ! Clear Cover with New Phone    3   1       0     0
    #2 Offer ! Buy New phone and get earphone at 1000. Limited Offer!    0   1       1     2