Im attempting to do some distance matching in R and am struggling to achieve a usable output.
I have a dataframe terms
that contains 5 strings of text, along with a category for each string. I have a second dataframe notes
that contains 10 poorly spelt words, along with a NoteID.
I want to be able to compare each of my 5 terms
against each of my 10 notes
using a distance algorithm to try to grab simple spelling errors. I have tried:
near_match<- subset(notes, jarowinkler(notes$word, terms$word) >0.9)
NoteID Note
5 e5 thought
10 e5 tough
and
jarowinkler(notes$word, terms$word)
[1] 0.8000000 0.7777778 0.8266667 0.8833333 0.9714286 0.8000000 0.8000000 0.8266667 0.8833333 0.9500000
The first instance is almost what I need, it just lacks the word from terms
that has caused the match. The second returns 10 scores but I'm not sure if the algorithm checked each of the 5 terms
against each of the 10 notes
in turn and just returned the closest match (highest score) or not.
How can I alter the above to achieve my desired output if what I want is achievable using jarowinkler()
or is there a better option?
I'm relatively new to R so appreciate any help in furthering my understanding how the algorithm generates the scores and what the approach to achieve my desired output would be.
example dataframes below
Thanks
> notes
NoteID word
1 a1 hit
2 b2 hot
3 c3 shirt
4 d4 than
5 e5 thought
6 a1 hat
7 b2 get
8 c3 shirt
9 d4 than
10 e5 tough
> terms
Category word
1 a hot
2 b got
3 a shot
4 d that
5 c though
Your data.frames:
notes<-data.frame(NoteID=c("a1","b2","c3","d4","e5","a1","b2","c3","d4","e5"),
word=c("hit","hot","shirt","than","thought","hat","get","shirt","that","tough"))
terms<-data.frame(Category=c("a","b","c","d","e"),
word=c("hot","got","shot","that","though"))
Use stringdistmatrix
(package stringdist
) with method "jw" (jarowinkler)
library(stringdist)
dist<-stringdistmatrix(notes$word,terms$word,method = "jw")
row.names(dist)<-as.character(notes$word)
colnames(dist)<-as.character(terms$word)
Now you have all distances:
dist
hot got shot that though
hit 0.2222222 0.4444444 0.27777778 0.27777778 0.50000000
hot 0.0000000 0.2222222 0.08333333 0.27777778 0.33333333
shirt 0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
than 0.4722222 1.0000000 0.50000000 0.16666667 0.38888889
thought 0.3571429 0.5158730 0.40476190 0.40476190 0.04761905
hat 0.2222222 0.4444444 0.27777778 0.08333333 0.50000000
get 0.4444444 0.2222222 0.47222222 0.47222222 0.50000000
shirt 0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
that 0.2777778 0.4722222 0.33333333 0.00000000 0.38888889
tough 0.4888889 0.4888889 0.51666667 0.51666667 0.05555556
Find the word more close to notes
output<-cbind(notes,word_close=terms[as.numeric(apply(dist, 1, which.min)),"word"],dist_min=apply(dist, 1, min))
output
NoteID word word_close dist_min
1 a1 hit hot 0.22222222
2 b2 hot hot 0.00000000
3 c3 shirt shot 0.21666667
4 d4 than that 0.16666667
5 e5 thought though 0.04761905
6 a1 hat that 0.08333333
7 b2 get got 0.22222222
8 c3 shirt shot 0.21666667
9 d4 that that 0.00000000
10 e5 tough though 0.05555556
If you want have just in word_close the words under a certain distance threshold (in this case 0.1), you can do this:
output[output$dist_min>=0.1,c("word_close","dist_min")]<-NA
output
NoteID word word_close dist_min
1 a1 hit <NA> NA
2 b2 hot hot 0.00000000
3 c3 shirt <NA> NA
4 d4 than <NA> NA
5 e5 thought though 0.04761905
6 a1 hat that 0.08333333
7 b2 get <NA> NA
8 c3 shirt <NA> NA
9 d4 that that 0.00000000
10 e5 tough though 0.05555556