rdplyrrowid

Group By and Create an Column Index in R


I have a data set of users who attempt a set of math problems. They are randomly given a hint(s) to solve the problem (hint.flag). I want to find out how soon a user gets the "correct" answer (Correctness) for each problem after seeing the FIRST hint within a problem. To begin with I would need to create some kind of index to identify which post hint attempt they are on.

data.input <- read_table2("user.id  problem.id  correctness hint.flag
540995  A   FALSE   1
540995  A   FALSE   0
540995  A   FALSE   1
540995  B   FALSE   1
540995  B   FALSE   1
540995  B   TRUE    0
540995  C   FALSE   1
540995  C   FALSE   1
540995  C   TRUE    0
540995  D   TRUE    0
565662  A   TRUE    0
565662  A   FALSE   1
565662  A   FALSE   0
565662  A   FALSE   1
565662  A   FALSE   1
565662  A   FALSE   0
565662  A   FALSE   0
565662  A   FALSE   0")

The column I am trying to impute is "post.hint.attempt" (see desired.output).

desired.output <- read_table2("user.id  problem.id  correctness hint.flag   post.hint.attempt

540995  A   FALSE   1   
540995  A   FALSE   0   first
540995  A   FALSE   1   second
540995  B   FALSE   1   
540995  B   FALSE   1   first
540995  B   TRUE    0   second
540995  C   FALSE   1   
540995  C   FALSE   1   first
540995  C   TRUE    0   second
540995  D   TRUE    0   
565662  A   TRUE    0   
565662  A   FALSE   1   
565662  A   FALSE   0   first
565662  A   FALSE   1   second
565662  A   FALSE   1   third
565662  A   FALSE   0   fourth
565662  A   FALSE   0   fifth
565662  A   FALSE   0   sixth 
")

I have no idea how to even begin finding this... I think row_id might be useful, but the need to lag the attempt number by the occurrence of a hint is confusing to me. Any help would be amazing!!!


Solution

  • Just in case, here is another approach with tidyverse. It involves checking the row_number within a given problem, and add a counter value if it comes after the first hint. A cumulative sum of this counter is then added.

    (Note you will get a warning if a problem does not have any hints.)

    library(tidyverse)
    
    data.input %>%
      group_by(user.id, problem.id) %>%
      mutate(count = ifelse(row_number() > min(which(hint.flag == 1)), 1, 0),
             post.hint.attempt = cumsum(count)) %>%
      select(-count) 
    

    Output

    # A tibble: 18 x 5
    # Groups:   user.id, problem.id [5]
       user.id problem.id correctness hint.flag post.hint.attempt
         <dbl> <chr>      <lgl>           <dbl>             <dbl>
     1  540995 A          FALSE               1                 0
     2  540995 A          FALSE               0                 1
     3  540995 A          FALSE               1                 2
     4  540995 B          FALSE               1                 0
     5  540995 B          FALSE               1                 1
     6  540995 B          TRUE                0                 2
     7  540995 C          FALSE               1                 0
     8  540995 C          FALSE               1                 1
     9  540995 C          TRUE                0                 2
    10  540995 D          TRUE                0                 0
    11  565662 A          TRUE                0                 0
    12  565662 A          FALSE               1                 0
    13  565662 A          FALSE               0                 1
    14  565662 A          FALSE               1                 2
    15  565662 A          FALSE               1                 3
    16  565662 A          FALSE               0                 4
    17  565662 A          FALSE               0                 5
    18  565662 A          FALSE               0                 6