I have a set of 40.000 rows x 4 columns and I need to compare each column to itself in order to find the most closest result or the minimum levenshtein distance. The idea is to get an "almost duplicate" for every row. I have calculated with "adist" but seems too slow. For example, for only one column, 5.000 rows compared to all column dataset, 40.000 rows, takes almost 2 hours. This is, for 4 columns, 8 hours, and for the entire dataset, 32 hours. Is there any faster way to achieve the same? I need it to be in 1 or 2 hours if possible. This is an example of what have I done so far:
#vector example
a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))
#execution time
start_time <- Sys.time()
#Matrix with distance
dist.name<-adist(a,b, partial = TRUE, ignore.case = TRUE)
#time elapsed
end_time <- Sys.time()
end_time - start_time
Output:
Time difference of 5.873202 secs
#result
dist.name
[,1] [,2] [,3]
[1,] 0 4 5
[2,] 2 0 2
[3,] 5 4 0
[1,] 4
[2,] 2
[3,] 4
You could try stringsdist
-package.
It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.
library(stringdist)
a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))
start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()
> end_time - start_time
Time difference of 0.006981134 secs
> res
[,1] [,2] [,3]
[1,] 0 2 3
[2,] 2 0 3
[3,] 3 3 0
diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3