rdplyrtidyversedata-wranglingwide-format-data

R dplyr solution to pivot categorical matches wider


I feel like this is such a simple thing, but I'm having so much trouble wrapping my head around it for some reason.

Say I have a long-format data frame of two categorical variables like so:

df <- data.frame(
  table_name = c("tbl1", "tbl1", "tbl1", "tbl2", "tbl2", "tbl2", "tbl3", "tbl3",
                 "tbl3"),
  column_name = c("first name", "address", "city", "first name", "last name",
                  "ice cream", "last name", "potato", "ice cream")
)

which looks like this:

df

table_name  column_name
tbl1        first name          
tbl1        address         
tbl1        city            
tbl2        first name          
tbl2        last name           
tbl2        ice cream           
tbl3        last name           
tbl3        potato          
tbl3        ice cream

I'd like to pivot_wider() to essentially create a matrix of matched column_name between all combinations of table_name. The expected output would be something like this:

     tbl1 tbl2 tbl3
tbl1    3    1    0
tbl2    1    3    2
tbl3    0    2    3

If it helps at all, the use case here is that I want to build a network graph using igraph. I've got a database that I want to visualize all the column_name connections among table_name.


Solution

  • We could self-join, count matches by table_name, and pivot_wider:

    library(tidyverse)
    inner_join(df, df, join_by(column_name)) |>
      count(table_name.x, table_name.y) |>
      pivot_wider(names_from = table_name.y, values_from = n, values_fill = 0)
    
    # A tibble: 3 × 4
      table_name.x  tbl1  tbl2  tbl3
      <chr>        <int> <int> <int>
    1 tbl1             3     1     0
    2 tbl2             1     3     2
    3 tbl3             0     2     3