rdataframedplyrtidyr

Convert a dataframe of nearest neighbors to onehot coding


Let's say we took the mtcars data and ran a PCA. Then, we want to know which brands of cars are most similar in PC space, i.e. the nearest neighbors. So someone ran a nearest neighbors analysis and recorded it.

Then, I am given a dataframe that looks like this, with the focal cars as the car column, and the first, and second nearest neighbors, n1, and n2, listed in their own columns.

tibble(car = c("Honda", "Toyota", "Mazda", "Fiat", "Lotus"),
       nn1 = c("Toyota", "Honda", "Toyota", "Lotus", "Mazda"),
       nn2 = c("Mazda", "Mazda", "Honda", "Honda", "Fiat"))
# A tibble: 5 × 3
  car    nn1    nn2  
  <chr>  <chr>  <chr>
1 Honda  Toyota Mazda
2 Toyota Honda  Mazda
3 Mazda  Toyota Honda
4 Fiat   Lotus  Honda
5 Lotus  Mazda  Fiat 

I would like to convert this to a one-shot style dataframe, where the 5 focal car brands are the rows, and the columns are the possible neighbors, with each encoded 0 or 1 depending on whether or not it was one of the nearest neighbors to the focal car. So as a tibble, it would look like this:

# A tibble: 5 × 6
  cars   Honda Toyota Mazda  Fiat Lotus
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Honda      0      1     1     0     0
2 Toyota     1      0     1     0     0
3 Mazda      1      1     0     0     0
4 Fiat       1      0     0     0     1
5 Lotus      0      0     1     1     0

or it could be a dataframe like this:

       Honda Toyota Mazda Fiat Lotus
Honda      0      1     1    0     0
Toyota     1      0     1    0     0
Mazda      1      1     0    0     0
Fiat       1      0     0    0     1
Lotus      0      0     1    1     0

Solution

  • More of an adjacency matrix than a one-hot encoding matrix. Calling your data df:

    library(tidyr)
    library(dplyr)
    df |>
      pivot_longer(-car) |>
      mutate(fill = 1) |>
      pivot_wider(id_cols = car, names_from = value, values_from = fill, values_fill = 0)
    # # A tibble: 5 × 6
    #   car    Toyota Mazda Honda Lotus  Fiat
    #   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
    # 1 Honda       1     1     0     0     0
    # 2 Toyota      0     1     1     0     0
    # 3 Mazda       1     0     1     0     0
    # 4 Fiat        0     0     1     1     0
    # 5 Lotus       0     1     0     0     1