rstringduplicatescoding-efficiency

Identifying and counting "correct" letters in a comma delimited string in R


In this problem, I am trying to "score" a task from a computer administered questionnaire, where each person is asked to type 5, 3-letter words into a text box. The result in the data.frame looks like this:

dput(head(WRRsp, 10))

structure(list(Subject = c(100101, 100108, 100110, 100114, 100119, 
100123, 100133, 100148, 100155, 100159), WRRsp = c("Elk,Mop,Bat,Sop,Car", 
"Ely,Mop,Bat,Toe,Spy", "Bat,Mop,Spy,Elk,Top", "Spy,Bad,Toe", 
"Elk,Mop,Toe", "Eelk,Spy", "Elk,Toe,Mop,Box,Car", "Mope,Eik", 
"Ee,Elk,Mop,Bat,Fox", "E,L,K,Mop,Spy")), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))

The correct responses are the following

  1. elk
  2. mop
  3. bat
  4. toe
  5. spy

The idea is to create a 0-15 score taking into account points per letter for targets. For example, "Elk, Mop, Bat, Sky" would score as 11 out of 15 because "Sky" is not a correct word, but contains two of three appropriate letters. The major problem I'm encountering is that the order of the words doesn't matter, they could be entered in any order.

My first pass at this was to score based on position, so the targets would be:

col1 col2 col3
e l k
m o p
b a t
t o e
s p y

An example scoring, if doing this by hand:

Subject WRRsp 1st 2nd 3rd Total
100101 Elk,Mop,Bat,Sop,Car 4 4 3 11
100108 Ely,Mop,Bat,Toe,Spy 5 5 4 14
100110 Bat,Mop,Spy,Elk,Top 5 5 4 14
100114 Spy,Bad,Toe 3 3 2 8
100119 Elk,Mop,Toe 3 3 3 9
100123 Eelk,Spy 2 1 1 4
100133 Elk,Toe,Mop,Box,Car 4 4 3 11
100148 Mope,Eik 2 1 2 5
100155 Ee,Elk,Mop,Bat,Fox 3 4 3 10
100159 E,L,K,Mop,Spy 3 2 2 7

Some context: this is only one question in a large questionnaire designed to test for signs of cognitive impairment.

I can get really close with way too much code:

target1 <- c("e", "m", "b", "t", "s")
target2 <- c("l", "o", "a", "o", "p")
target3 <- c("k", "p", "t", "e", "y")

wrrsp_score <- WRRsp %>%
  separate_wider_delim(
    "WRRsp",
    delim = ",",
    names = c("w1", "w2", "w3", "w4", "w5"),
    too_few = "align_start"
  ) %>%
  mutate(
    col1 = paste(
      str_sub(w1, 1, 1),
      str_sub(w2, 1, 1),
      str_sub(w3, 1, 1),
      str_sub(w4, 1, 1),
      str_sub(w5, 1, 1)
    ),
    col1 = str_remove_all(col1, "NA"),
    # this stuff isn't really necessary
    col1 = str_trim(col1),
    col1 = tolower(col1),
    
    col2 = paste(
      str_sub(w1, 2, 2),
      str_sub(w2, 2, 2),
      str_sub(w3, 2, 2),
      str_sub(w4, 2, 2),
      str_sub(w5, 2, 2)
    ),
    col2 = str_remove_all(col2, "NA"),
    col2 = str_trim(col2),
    col2 = tolower(col2),
    
    col3 = paste(
      str_sub(w1, 3, 3),
      str_sub(w2, 3, 3),
      str_sub(w3, 3, 3),
      str_sub(w4, 3, 3),
      str_sub(w5, 3, 3)
    ),
    col3 = str_remove_all(col3, "NA"),
    col3 = str_trim(col3),
    col3 = tolower(col3)
  ) %>%
  mutate(
    col1_e = ifelse(grepl(target1[1], col1), 1, 0),
    col1_m = ifelse(grepl(target1[2], col1), 1, 0),
    col1_b = ifelse(grepl(target1[3], col1), 1, 0),
    col1_t = ifelse(grepl(target1[4], col1), 1, 0),
    col1_s = ifelse(grepl(target1[5], col1), 1, 0),
    col1_score = rowSums(pick(col1_e:col1_s), na.rm = TRUE),
    
    col2_l  = ifelse(grepl(target2[1], col2), 1, 0),
    col2_o  = ifelse(grepl(target2[2], col2), 1, 0),
    col2_a  = ifelse(grepl(target2[3], col2), 1, 0),
    col2_o2 = ifelse(grepl(target2[4], col2), 1, 0),
    col2_s  = ifelse(grepl(target2[5], col2), 1, 0),
    col2_score = rowSums(pick(col2_l:col2_s), na.rm = TRUE),
    
    col3_k = ifelse(grepl(target3[1], col3), 1, 0),
    col3_p = ifelse(grepl(target3[2], col3), 1, 0),
    col3_t = ifelse(grepl(target3[3], col3), 1, 0),
    col3_e = ifelse(grepl(target3[4], col3), 1, 0),
    col3_y = ifelse(grepl(target3[5], col3), 1, 0),
    col3_score = rowSums(pick(col3_k:col3_y), na.rm = TRUE)
  ) %>%
  mutate(WRRspPerCharAC = rowSums(pick(ends_with("_score")), na.rm = TRUE))

Where this fails, is with the second "o" in target2, it counts any "o" twice even if there are isn't anything to count (e.g., only two words entered). Two things: how to fix the duplicate issue, and how to make this more efficient/more streamlined, because what I have above seems overengineered and insane.


Solution

  • Here's one attempt that looks to return the correct result with your sample data. It relies on pmatch() to match the correct vector against the vectors of answers split on letter positions.

    library(stringr)
    library(dplyr)
    library(tidyr)
    library(purrr)
    
    targets <- list(c("e", "m", "b", "t", "s"),
                    c("l", "o", "a", "o", "p"),
                    c("k", "p", "t", "e", "y"))
    
    dat |>
      mutate(score = tolower(WRRsp) |>
               str_split(",") |>
               map(
                 ~ str_split_fixed(.x, "", n = Inf)[, 1:3] |>
                   asplit(2) |>
                   map2_dbl(targets,
                            ~ sum(pmatch(.y, .x, nomatch = 0) > 0))
               )) |>
      unnest_wider(score, names_sep = "_") |>
      mutate(Total = rowSums(pick(starts_with("score"))))
    
    # A tibble: 10 × 6
       Subject WRRsp               score_1 score_2 score_3 Total
         <dbl> <chr>                 <dbl>   <dbl>   <dbl> <dbl>
     1  100101 Elk,Mop,Bat,Sop,Car       4       4       3    11
     2  100108 Ely,Mop,Bat,Toe,Spy       5       5       4    14
     3  100110 Bat,Mop,Spy,Elk,Top       5       5       4    14
     4  100114 Spy,Bad,Toe               3       3       2     8
     5  100119 Elk,Mop,Toe               3       3       3     9
     6  100123 Eelk,Spy                  2       1       1     4
     7  100133 Elk,Toe,Mop,Box,Car       4       4       3    11
     8  100148 Mope,Eik                  2       1       2     5
     9  100155 Ee,Elk,Mop,Bat,Fox        3       4       3    10
    10  100159 E,L,K,Mop,Spy             3       2       2     7