Let's say we took the mtcars
data and ran a PCA. Then, we want to know which brands of cars are most similar in PC space, i.e. the nearest neighbors. So someone ran a nearest neighbors analysis and recorded it.
Then, I am given a dataframe that looks like this, with the focal cars as the car
column, and the first, and second nearest neighbors, n1
, and n2
, listed in their own columns.
tibble(car = c("Honda", "Toyota", "Mazda", "Fiat", "Lotus"),
nn1 = c("Toyota", "Honda", "Toyota", "Lotus", "Mazda"),
nn2 = c("Mazda", "Mazda", "Honda", "Honda", "Fiat"))
# A tibble: 5 × 3
car nn1 nn2
<chr> <chr> <chr>
1 Honda Toyota Mazda
2 Toyota Honda Mazda
3 Mazda Toyota Honda
4 Fiat Lotus Honda
5 Lotus Mazda Fiat
I would like to convert this to a one-shot style dataframe, where the 5 focal car brands are the rows, and the columns are the possible neighbors, with each encoded 0 or 1 depending on whether or not it was one of the nearest neighbors to the focal car. So as a tibble, it would look like this:
# A tibble: 5 × 6
cars Honda Toyota Mazda Fiat Lotus
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Honda 0 1 1 0 0
2 Toyota 1 0 1 0 0
3 Mazda 1 1 0 0 0
4 Fiat 1 0 0 0 1
5 Lotus 0 0 1 1 0
or it could be a dataframe like this:
Honda Toyota Mazda Fiat Lotus
Honda 0 1 1 0 0
Toyota 1 0 1 0 0
Mazda 1 1 0 0 0
Fiat 1 0 0 0 1
Lotus 0 0 1 1 0
More of an adjacency matrix than a one-hot encoding matrix. Calling your data df
:
library(tidyr)
library(dplyr)
df |>
pivot_longer(-car) |>
mutate(fill = 1) |>
pivot_wider(id_cols = car, names_from = value, values_from = fill, values_fill = 0)
# # A tibble: 5 × 6
# car Toyota Mazda Honda Lotus Fiat
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Honda 1 1 0 0 0
# 2 Toyota 0 1 1 0 0
# 3 Mazda 1 0 1 0 0
# 4 Fiat 0 0 1 1 0
# 5 Lotus 0 1 0 0 1