I am working in R
My data has 500,000 rows in, but small example is used here.
I have some data for staff working in schools. Some people work in one school, some two, some three etc. Each school they work in is a row of data.
Schools do not always record the first names the same for an individual. e.g. one school records as Will, another William.
I also have this assumption: for an individual working in more than one school, their second name and date of birth is always recorded the same in each school.
Based on the similarity of their first name, I want a way of identifying the people who are likely to the same person, and then allocate them an id.
There will be some sort of cut off e.g. greg and griffin are likely to not be the same person, even though they share the same first 2 letters.
sample data:
data_current <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"),
last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"),
school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9))
first_name | second_name | date_of_birth | school_id |
---|---|---|---|
will | smith | 2000-01-02 | 1 |
william | smith | 2000-01-02 | 2 |
william | smith | 2000-01-02 | 3 |
laura | maxwell | 2007-01-02 | 4 |
jessica | maxwell | 2007-01-02 | 5 |
jessicalouise | maxwell | 2007-01-02 | 6 |
james | lead | 1999-01-02 | 7 |
greg | jones | 2004-01-02 | 8 |
griffin | jones | 2004-01-02 | 9 |
desired data:
It is likely that the first three people are the same person, so are allocated the same person_id, and so on...
data_desired <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"),
last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"),
school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
person_id = c(1, 1, 1, 2, 3, 3, 4, 5, 6))
first_name | second_name | date_of_birth | school_id | person_id |
---|---|---|---|---|
will | smith | 2000-01-02 | 1 | 1 |
william | smith | 2000-01-02 | 2 | 1 |
william | smith | 2000-01-02 | 3 | 1 |
laura | maxwell | 2007-01-02 | 4 | 2 |
jessica | maxwell | 2007-01-02 | 5 | 3 |
jessicalouise | maxwell | 2007-01-02 | 6 | 3 |
james | lead | 1999-01-02 | 7 | 4 |
greg | jones | 2004-01-02 | 8 | 5 |
griffin | jones | 2004-01-02 | 9 | 6 |
Does anyone have an suggestions of how to solve this?
Here is one possibility using string similarity. It works well in your example because you have complete separation in your data. However, over 500K rows you may run into some issues with this, but it could get you started:
library(stringdist)
library(dplyr)
library(purrr)
group_first <- function(dat, cutoff = 0.5) {
if (n_distinct(dat) == 1) return(1)
M <- stringsimmatrix(dat$first_name, dat$first_name, method = "cosine", q = 2)
l <- unique(lapply(seq_len(nrow(M)), \(x) dat$first_name[M[x, ] > cutoff]))
person_id <- map(dat$first_name, ~ which(grepl(.x, l, fixed = T))) |>
modify_if(~ length(.x) > 1, ~ 0)
return(unlist(person_id))
}
data_current |>
mutate(person_id = group_first(pick(everything())), .by = c(date_of_birth, last_name)) |>
# review output before running next mutate
mutate(person_id = cur_group_id(), .by = c(date_of_birth, last_name, person_id))
Output
first_name last_name date_of_birth school_id person_id
1 will smith 2000-01-02 1 1
2 william smith 2000-01-02 2 1
3 william smith 2000-01-02 3 1
4 laura maxwell 2007-01-02 4 2
5 jessica maxwell 2007-01-02 5 3
6 jessicalouise maxwell 2007-01-02 6 3
7 james lead 1999-01-02 7 4
8 greg jones 2004-01-02 8 5
9 griffin jones 2004-01-02 9 6
How it works
As mentioned in the comments this can be tricky for names that have similar bigrams but are not the same. These cases will be hard to distinguish. Also, since this is just a lexical comparison it likely will not work well for shortened names. For example, "Bob" short for "Robert" or "Dick" short for "Richard".
You may also consider using other string similarity measures. For example, you can specify the Jaro–Winkler similarity (method = 'jw'
) with a weight (p = 0.10
). This will weight first names that start the same ("will" versus "william"), resulting in a higher similarity score and potential separation.