rtexttidyversetidytext

How to Keep Group Associated when Separating Free Text Fields in R


I'm trying to separate out free text fields until individual words/phrases, while also keeping their association with a group so I can stratify in my graphing later on

here is my original code. I'm trying to add in a "year" variable so I can stratify the different research interests by what year a student is in. I'd like to have a total n for each word, as well as n for each year

example of my data set:

Please.list.your.research.interests Year
Vaccines, TB, HIV 1st year
TB, Chronic Diseases 2nd year
library(tidyverse)
library(tidytext)

data_research_words <- unlist(strsplit(data_research$Please.list.your.research.interests, ", "))

text_df <- tibble(line=1:97, data_research_words)

text_count <- text_df %>% 
  count(data_research_words, sort=TRUE)

Solution

  • Something like this?

    library(tidyverse)
    
    # split on commas, to create a separate row for each list element
    df <- df |>
      separate_longer_delim("Please.list.your.research.interests", ", ")
    
    # then get the count for each research interest
    df |> count(Please.list.your.research.interests)
    
    # ...and the same, but separated also by years
    df |> count(Year, Please.list.your.research.interests)
    

    Output:

      Please.list.your.research.interests n
    1                    Chronic Diseases 1
    2                                 HIV 1
    3                                  TB 2
    4                            Vaccines 1
    
          Year Please.list.your.research.interests n
    1 1st year                                 HIV 1
    2 1st year                                  TB 1
    3 1st year                            Vaccines 1
    4 2nd year                    Chronic Diseases 1
    5 2nd year                                  TB 1