rdplyrcross-correlationhmisc

Cross sectional correlation across grouped data and summarized in latex table


I have a time-series panel dataset which is structured in the following way:

df <- data.frame(
  year = c(2012L, 2013L, 2014L, 2012L, 2013L, 2014L, 2015L),
  id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L),
  col1 = c(11L, 13L, 13L, 16L, 15L, 15L, 16L),
   col2 = c(10L, 14L, 12L, 13L, 11L, 16L, 17L),
col3 = c(17L, 12L, 12L, 14L, 19L, 21L, 12L),
)
> df
  year id col1 col2 col3
1 2012  1   11   10   17
2 2013  1   13   14   12
3 2014  1   13   12   12
4 2012  2   16   13   14
5 2013  2   15   11   19
6 2014  2   15   16   21
7 2015  2   16   17   12
> 

I would like to generate a cross-sectional lower triangle correlation latex table across each column pair and across all groups but I want the final table to be the average across all groups and also include the p stat. This is what I have so far using dplyr:

library(dplyr)
df %>%
  group_by(id) %>%
  summarize(COR=cor(col1,col2))

But I would like to have this for all column pairs and in my actual dataset, I have many more ids. I would like to use xtable, stargazer, or Hmisc to generate a latex correlation table that has the average corr across groups as the output and also includes the p-value. I would like my final output to look like something like this: imgur.com/a/7Jwmm8f


Solution

  • An option would be to split by 'id' column, then apply the cor on the 'col' columns, get the elementwise + and divide by the length of unique 'id' and replace the upper.tri values to NA

    out <- Reduce(`+`, lapply(split(df[3:5], df$id),
          function(x) cor(x, use = "complete.obs")))/length(unique(df$id))
    out[upper.tri(out)] <- NA
    

    -output

    out
    #           col1      col2 col3
    #col1  1.0000000        NA   NA
    #col2  0.5902554  1.000000   NA
    #col3 -0.9807620 -0.569806    1
    

    Or using tidyverse

    library(dplyr)
    library(purrr)
    library(magrittr)
    df %>% 
      select(-year) %>%
      group_split(id, .keep = FALSE) %>%
      map(cor, use = "complete.obs") %>% 
      reduce(`+`) %>% 
      divide_by(n_distinct(df$id)) %>% 
      replace(., upper.tri(.), NA)
    #           col1      col2 col3
    #col1  1.0000000        NA   NA
    #col2  0.5902554  1.000000   NA
    #col3 -0.9807620 -0.569806    1