rdataframefrequencymultiple-choice

How to present the frequencies of each of the choices of multiple-choices questions that are presented in different ways?


I have this example dataframe (my real dataframe is larger and this one includes all the cases I am facing with my big dataframe)

df = data.frame(ingridents = c('bread', 'BREAD', 'Bread orange juice',
                               'orange juice', 'Apple', 'apple bread, orange juice',
                               'bread Apple ORANGE JUICE'),
                Frequency = c(10,3,5,4,2,3,1) )

In this df dataframe we can see that :

the ingridient bread is drafted as bread, BREAD and Bread (alone or with other answers). The same thing with the ingridient apple.

the ingridient orange juice is drafted in multiple forms and in one of the groups of responses there is a comma and in another there is no comma. Also, I want R to recognize the totality of the orange juice expression. Not orange alone and juice alone.

The objective is to create another dataframe with each of these 3 ingridients and their frequencies as follows :

     ingridents Frequency
1        BREAD        22
2 ORANGE JUICE        13
3        APPLE         6

How can I program an algorithm on R so that he can recognise each response with its total frequency (wheather it includes capital or small letters or wheather it is formed of two-word expressions such as orange juice) ?


Solution

  • Here is one way to do it. First, we'll do some string preprocessing (i.e. get all strings in upper case, remove commas and concatenate the juice), then split by space and do the summing:

    library(tidyr)
    library(dplyr)
    library(stringr)
    
    df |>
      mutate(ingridents = ingridents |>
                          toupper() |>
                          str_remove_all(",") |>
                          str_replace_all("ORANGE JUICE", "ORANGE_JUICE")) |>
      separate_rows(ingridents, sep = " ") |>
      count(ingridents, wt = Frequency) |>
      arrange(desc(n)) |>
      mutate(ingridents = str_replace_all(ingridents, "ORANGE_JUICE", "ORANGE JUICE"))
    

    Output:

    # A tibble: 3 × 2
      ingridents       n
      <chr>        <dbl>
    1 BREAD           22
    2 ORANGE JUICE    13
    3 APPLE            6