rdifferencepairwiserowwise

group_by row wise distance between multiple columns


Posting an updated question from my last post as I think my requested output was a bit too complicated group_by edit distance between rows over multiple columns. Here, I have simplified the desired output.

I have the following data frame.

Input:

class   id  q1  q2  q3  q4
Ali     12  1   2   3   3
Tom     16  1   2   4   2
Tom     18  1   2   3   4
Ali     24  2   2   4   3
Ali     35  2   2   4   3
Tom     36  1   2   4   2
  1. class indicates the teacher's name,
  2. id indicates the student user ID, and,
  3. q1, q2, q3 and q4 indicate marks on different test questions

Requirement:

I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. I realise it's a very naive assumption - I have just made up this example so that I can use this code in another context.

For example, looking at the class Tom, the IDs 16 and 36 have the same scores across the columns q1, q2, q3 and q4, which means that their score difference is zero (the difference between corresponding entries is zero).

Therefore, I want to calculate the row-wise sum of the absolute difference between multiple columns where the rows are grouped by class name. Based on this absolute difference, I want to include additional columns such as diff0, diff1, diff2 and diff3:

  1. diff0: For a given ID, this column lists all the other id's that have a pairwise difference of zero with this student, given that they belong to the same class.
  2. diff1: For a given ID, this column lists all the other id's that have a pairwise difference of one with this student, given that they belong to the same class. and lastly,
  3. diff2: For a given ID, this column lists all the other id's that have a pairwise difference of three with this students, given that they belong to the same class etc.

My desired output is as below:

class   id  q1  q2  q3  q4  diff0   diff1   diff2   diff3
Ali     12  1   2   3   3   NA      NA      24,35   NA
Tom     16  1   2   4   2   36      NA      NA      18
Tom     18  1   2   3   4   NA      NA      NA      16,36
Ali     24  2   2   4   3   35      NA      12      NA
Ali     35  2   2   4   3   24      NA      12      NA
Tom     36  1   2   4   2   16      NA      NA      18

Is it possible to achieve this using dplyr?

Related posts:

I have tried to look for related solutions, but none of them addresses the exact problem that I am facing, e.g.,

R - Calculate the differences in the column values between rows/ observations (all combinations), Creating new field that shows stringdist between two columns in R?, R Calculating difference between values in a column, and, R Calculate the difference between values from one to all the other columns

dput()

For your convenience, I am sharing data dput():

structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

Any help on this would be greatly appreciated!


Solution

  • A possible solution using dist to calculate distances with the manhattan method.

    library(dplyr)
    library(tidyr)
    
    df |>
        group_by(class) |>
        group_modify(~{
            ## prepare the dataframe for pairwise dist calculation using `dist'
            ## add rownames to dataframe so that results can be easily
            ## delt with after
            tibble::column_to_rownames(.x, "id") |>
                select(q1:q4) |>
                dist(method = "manhattan") |>
                as.matrix() |>
                ## transform the distance matrix into a <<long>> dataframe
                ## consisting of pairwise comparisons between 
                ## `id' and `other'
                as_tibble(rownames = "id") |>
                pivot_longer(-id,
                             names_to = "other",
                             values_to = "diff") |>
                ## filter out comparisons of `id' with itself 
                filter(id != other) |>
                ## expand the dataframe as the cartesian product of 
                ## its rows and the wanted `cutoff's (0:3)   
                crossing(cutoff = 0:3) |>
                group_by(cutoff,id) |>
                ## for each cutoff and id combine the values in `other`
                ## that match cutoff and transform to the wanted <<wide>> format              
                summarize(other = paste(other[diff == cutoff], collapse = ",")) |>
                pivot_wider(names_from = cutoff,
                            values_from = other,
                            names_prefix = "diff")            
        }) |>
        ## this step is needed so that `id' has the same type of the 
        ## original dataframe
        mutate(id = as.integer(id)) |>
        right_join(df)
    
    ##> # A tibble: 6 × 10
    ##> # Groups:   class [2]
    ##>   class    id diff0 diff1 diff2   diff3      q1    q2    q3    q4
    ##>   <chr> <int> <chr> <chr> <chr>   <chr>   <int> <int> <int> <int>
    ##> 1 Ali      12 ""    ""    "24,35" ""          1     2     3     3
    ##> 2 Ali      24 "35"  ""    "12"    ""          2     2     4     3
    ##> 3 Ali      35 "24"  ""    "12"    ""          2     2     4     3
    ##> 4 Tom      16 "36"  ""    ""      "18"        1     2     4     2
    ##> 5 Tom      18 ""    ""    ""      "16,36"     1     2     3     4
    ##> 6 Tom      36 "16"  ""    ""      "18"        1     2     4     2