If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
which returns:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
According to this example, I would like to select one between each of the following pairs:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
The levenshteinSim
function in the RecordLinkage
package could help me:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
and so on, return values near 1 for the most similar sentences. I could write a double for loop
and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?
You could calculate an approximate string distance matrix using adist
, which is based on a generalized Levenstein distance, and do hierarchical clustering afterwards using hclust
.
ld <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
# x cl
# 1 So there I was at the mercy of three monstrous trolls 1
# 2 Today is my One Hundred and Eleventh birthday 2
# 3 I'm sorry I brrrought this upon, my 3
# 4 So there I was at mercy of three monstrous troll 1
# 5 Today is One Hundred Eleventh birthday 2
# 6 I'm sorry I brought this upon you, my 3
To find an appropriate value for h=
eight in cutree
we may plot the dendrogram.
plot(hc)
abline(h=10, col=2, lty=2)