r string grouping data-wrangling fuzzyjoin

How to group similar spelled character strings together?

I have a table of 10,000 unique names. Using the package(fuzzyjoin) I would like to match these unique names to names that are only spelled with one different letter. I would like to group the different spelling matches so that they can be assessed for misspelling and corrected in the dataset.

An example using baby names is below. I try to match the original name column to a new column, name_match, based on a single letter difference (max_dist = 1).

So for example, Aadam and Adam are matched because they only differ by a single letter, the extra 'a'. I want to group these two together in the dataset so that a unique ID refers to both names since they are the same, but one is misspelled.

baby_data <- 
  data.frame(babynames) %>% 
  select(name) %>% 
  unique() %>% 
  filter(str_starts(name, 'A'))

name_match <- 
  baby_data %>% 
  stringdist_left_join(.,
                       baby_data %>% 
                         rename(name_match = name),
                       max_dist = 1,
                       by = c('name' = 'name_match')) %>% 
  # remove rows where the names are the same
  filter(!(name == name_match))

A few complications arise; first, Aadam and Adam both show up as matches for each other in both columns (i.e., Aadam = Adam, Adam = Aadam). I only need one record to show the match.

Second, how do I group them? Ideally, the final table would look something like:

ID	Name
1	Aadam
1	Adam
2	Anne
2	Ann

I feel like there is a case_when() or if_else() problem nested into this?

Solution

First, we remove all duplicates from name_match in the sense that e.g. (Adam, Aadam) = (Aadam, Adam). To the remaining entries we assign an id to each tuple and finally unpivot the data.

library(dplyr)

name_match |>

    # remove all "duplicates"
    rowwise() |>
    mutate(key = paste(sort(c(name, name_match)), collapse = "")) |>
    distinct(key, .keep_all = T) |>
    select(-key) |>
    ungroup() |>

    # assign an id to each tuple of names
    mutate(id = row_number()) |>
    tidyr::pivot_longer(name:name_match) |>
    select(-name) |>
    rename(name = value)

Note that every name can appear multiple times in the result. Here is one example:

# A tibble: 4 × 2
     id name 
  <int> <chr>
1  1958 Adam 
2  1958 Adham
3  1965 Adam 
4  1965 Aadam

Data

library(fuzzyjoin)
library(dplyr)

baby_data <- 
  babynames::babynames |> 
  select(name) |> 
  unique() |> 
  filter(stringr::str_starts(name, 'A'))

name_match <- 
  baby_data |> 
  stringdist_left_join(baby_data |> 
                         rename(name_match = name),
                       max_dist = 1,
                       by = c('name' = 'name_match')) |> 
  filter(!(name == name_match))