rggplot2tidypairwise-distance

Compute differences between all variable pairs in R


I have a dataframe with 4 columns.

set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
           B = rpois(1000, lambda = 3),
           C = round(rnorm(1000, mean = -1)),
           D = round(rnorm(1000, mean = 0)))

I would like to compute the differences for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe. This would be the equivalent of doing df$A - df$B for every combination.

Can we use the dist() function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame to plot the results with ggplot2. Unless there is a good tidy version of doing the above.

Many Thanks

The closest I got was doing the below, but I am not sure to what the column names refer to.

d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)

Solution

  • Using base r:

    df_dist <- t(apply(df, 1, dist))
    colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
    

    If you really want to use a tidy-approach, you could go with c_across, but this also removes the names, and is much slower if your data is huge