I have a tibble of just 1 column called 'title'.
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
I wish to find similar records and group them together as such (all the while keeping their indices):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
I'm using the below function to find strings that are close enough to each other (cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
I've implemented the loop below to 'send' similar records together in a new dataframe but it's not working properly, I've tried a few variations but nothing is working yet.
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
(I've inserted prints in the loop 'to see what's happening')
REPRODUCIBLE DAT:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Any suggestions please? EDIT: I'd also like a new index column called 'group' as below:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing
I'm afraid I've never tried RecordLinkage
, but if you're just using the Jaro-Winkler distance it should also be fairly easy to cluster similar strings with the stringdist
package. Using your dput
above:
library(tidyverse)
library(stringdist)
map_dfr(dat$title, ~ {
i <- which(stringdist(., dat$title, "jw") < 0.40)
tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
Explanation:
map_dfr
iterates over each string in dat$title
, extracts the indices of the closest matches computed by stringdist
(constrained by 0.40, i.e. your "threshold"), creates a tibble with the indices and matches, then stacks these tibbles with a group
variable corresponding to the integer position (and row number) of the original string. distinct
then drops any cluster duplicates based on repeats of index
.
Output:
# A tibble: 13 x 3
group index title
<int> <int> <chr>
1 1 1 lymphoedema clinic
2 1 8 lymphoedema early intervenstion
3 1 9 lymphoedema expo
4 1 10 lymphoedema for breast care nurses
5 2 2 zostavax shingles vaccine
6 2 5 zostavax recall toolkit
7 2 11 xray meeting and case studies
8 3 3 xray operator
9 3 6 xray meetint
10 3 12 xray online examination
11 3 13 xray operator in service paediatric extremities
12 4 4 workplace mental health wellbeing workshop
13 4 7 workplace mental health and wellbeing
An interesting alternative would be to use tidytext
with widyr
to tokenize by word and compute the cosine similarity of the titles based on similar words, rather than characters as above.