I have created the following dataframe consisting of a few e-mail subject lines.
df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))
I have created a list of frequent words derived from the above dataframe. I have added these keywords to the dataframe and dummy coded them as 0
most_freq_words <- c('Free', 'New', 'Limited', 'Offer')
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 0 0 0 0
'Offer ! Buy New phone and get earphone at 0 0 0 0
1000. Limited Offer!'
I want to obtain a frequency count of the words in the e mail subject. The output should as follows
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 3 1 0 0
'Offer ! Buy New phone and get earphone at 0 1 1 2
1000. Limited Offer!'
I have tried the following code
for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]),
tolower(df$subject)))}
This however tells if the word is present or not in the sentence. I need the output given above. I request someone to help me
Here is another option with tidyverse
. We use map
to loop over the 'most_freq_words', get its count from 'subject' column of 'df' with str_count
, convert to tibble
, set the names of the column from the 'most_freq_words' and bind the columns with the original dataset 'df'
library(tidyverse)
most_freq_words %>%
map(~ str_count(df$subject, .x) %>%
as_tibble %>%
set_names(.x)) %>%
bind_cols(df, .)
# subject Free New Limited Offer
#1 Free ! Free! Free ! Clear Cover with New Phone 3 1 0 0
#2 Offer ! Buy New phone and get earphone at 1000. Limited Offer! 0 1 1 2