Posting an updated question from my last post as I think my requested output was a bit too complicated group_by edit distance between rows over multiple columns. Here, I have simplified the desired output.
I have the following data frame.
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. I realise it's a very naive assumption - I have just made up this example so that I can use this code in another context.
For example, looking at the class Tom, the IDs 16 and 36 have the same scores across the columns q1, q2, q3 and q4, which means that their score difference is zero (the difference between corresponding entries is zero).
Therefore, I want to calculate the row-wise sum of the absolute difference between multiple columns where the rows are grouped by class name. Based on this absolute difference, I want to include additional columns such as diff0, diff1, diff2 and diff3:
My desired output is as below:
class id q1 q2 q3 q4 diff0 diff1 diff2 diff3
Ali 12 1 2 3 3 NA NA 24,35 NA
Tom 16 1 2 4 2 36 NA NA 18
Tom 18 1 2 3 4 NA NA NA 16,36
Ali 24 2 2 4 3 35 NA 12 NA
Ali 35 2 2 4 3 24 NA 12 NA
Tom 36 1 2 4 2 16 NA NA 18
Is it possible to achieve this using dplyr?
I have tried to look for related solutions, but none of them addresses the exact problem that I am facing, e.g.,
R - Calculate the differences in the column values between rows/ observations (all combinations), Creating new field that shows stringdist between two columns in R?, R Calculating difference between values in a column, and, R Calculate the difference between values from one to all the other columns
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help on this would be greatly appreciated!
A possible solution using dist
to calculate distances with the manhattan method.
library(dplyr)
library(tidyr)
df |>
group_by(class) |>
group_modify(~{
## prepare the dataframe for pairwise dist calculation using `dist'
## add rownames to dataframe so that results can be easily
## delt with after
tibble::column_to_rownames(.x, "id") |>
select(q1:q4) |>
dist(method = "manhattan") |>
as.matrix() |>
## transform the distance matrix into a <<long>> dataframe
## consisting of pairwise comparisons between
## `id' and `other'
as_tibble(rownames = "id") |>
pivot_longer(-id,
names_to = "other",
values_to = "diff") |>
## filter out comparisons of `id' with itself
filter(id != other) |>
## expand the dataframe as the cartesian product of
## its rows and the wanted `cutoff's (0:3)
crossing(cutoff = 0:3) |>
group_by(cutoff,id) |>
## for each cutoff and id combine the values in `other`
## that match cutoff and transform to the wanted <<wide>> format
summarize(other = paste(other[diff == cutoff], collapse = ",")) |>
pivot_wider(names_from = cutoff,
values_from = other,
names_prefix = "diff")
}) |>
## this step is needed so that `id' has the same type of the
## original dataframe
mutate(id = as.integer(id)) |>
right_join(df)
##> # A tibble: 6 × 10
##> # Groups: class [2]
##> class id diff0 diff1 diff2 diff3 q1 q2 q3 q4
##> <chr> <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int>
##> 1 Ali 12 "" "" "24,35" "" 1 2 3 3
##> 2 Ali 24 "35" "" "12" "" 2 2 4 3
##> 3 Ali 35 "24" "" "12" "" 2 2 4 3
##> 4 Tom 16 "36" "" "" "18" 1 2 4 2
##> 5 Tom 18 "" "" "" "16,36" 1 2 3 4
##> 6 Tom 36 "16" "" "" "18" 1 2 4 2