I'm new here in more than one sense. First post regarding my first script in my first attempt of aquainting any programming language. In the light of that you might find this project to be overly ambitious, but hey, learning by doing has always been the way to go. I'm doing my best to meet stackoverflow-etiquette here, but let me know if I'm in violation of anything.
I wanted to write a piece of code that can apply some kind of fuzzy logic in matching a table of unstructured company names (e.g. Google) with a table of structured company names (e.g. Google Inc.) and Danish company identifiers (CVR
).
I was able to find some bits of code by googling around, and I managed to manipulate them to work with my project. I found that the Jaro-Winkler algorithm that is contained withing the stringdist
package works particularly well with company names. The script works perfectly fine when trying to compare and match 40 unstructured company names against a few hundred structured names, but I need to compare and match around 4000 unstructured names with a table containing 700k structured names. As you might have guessed this takes forever. To give you an idea, I've tried matching 6 unstructured names up against the 700k, which took three hours. A quick calculation tells me that if this is the average speed of the script, it will make me almost 3 months to process 4000 companies, which is a bit overwhelming. I understand that it has to do several billion calculations and that this cannot be done in a few minutes. If I could however minimize this to maybe just a few days I would be more than happy, and I feel like this must be possible.
So, I'm looking for methods to speed up this piece of code. I've already managed to improve it some by initially pairing up exact matches with the match()
function, which leaves around 500 companies for further processing with the fuzzy matching algorithm. Still, that takes a long time to say the least.
I hope I managed to explaing myself clearly! Any suggestions will be highly appreciated.
library(stringdist)
#Reading the two files to be compared and making sure that I'm dealing with characters
companies.unstructured <- read.csv(
"https://www.dropbox.com/s/opbk0s2q14l5c71/unstructured_companies.csv?dl=0",
sep = ";",
stringsAsFactors = FALSE
)
companies.structured <- read.csv(
"https://www.dropbox.com/s/kyi0rvz77frr7sd/structured_companies_w_CVR.csv?dl=0",
sep=";",
stringsAsFactors = FALSE
)
#Using the match function to match up all 100% identical companies to avoid unnecessary workload for the Jaro-Winkler loop
companies.unstructured$CVR = companies.structured$CVR[match(companies.unstructured$Company,
companies.structured$Company)]
companies.exact.match <- companies.unstructured[!is.na(companies.unstructured$CVR), ]
#Creating a subset to work on with the Jaro-Winkler loop.
companies.unstructured.na <- subset(companies.unstructured, is.na(CVR))
#And here's the loop measuring the distance between the company names using the Jaro-Winkler algorithm.
distance.methods<- c('jw')
dist.methods<-list()
for(m in 1:length(distance.methods))
{
dist.name.enh<-matrix(NA, ncol = length(companies.structured$Company),
nrow = length(companies.unstructured.na$Company))
for(i in 1:length(companies.structured$Company)) {
for(j in 1:length(companies.unstructured.na$Company)) {
dist.name.enh[j,i]<-stringdist(tolower(companies.structured[i,]$Company),
tolower(companies.unstructured.na[j,]$Company),
method = distance.methods[m])
}
}
dist.methods[[distance.methods[m]]]<-dist.name.enh
}
#matching up pairs of minimum distance
match.s1.s2.enh<-NULL
for(m in 1:length(dist.methods))
{
dist.matrix<-as.matrix(dist.methods[[distance.methods[m]]])
min.name.enh<-apply(dist.matrix, 1, base::min)
for(i in 1:nrow(dist.matrix))
{
s2.i<-match(min.name.enh[i],dist.matrix[i,])
s1.i<-i
match.s1.s2.enh<-rbind(data.frame(s2.i=s2.i,
s1.i=s1.i,
s1Company=companies.unstructured.na[s1.i,]$Company,
s2Company=companies.structured[s2.i,]$Company,
CVR=companies.structured[s2.i,]$CVR,
adist=min.name.enh[i],
method=distance.methods[m]),
match.s1.s2.enh)
}
}
EDIT: Here's some data examples to work with: structured_companies_w_CVR and unstructured_companies
I profiled your code and found some speedups. I kept as much as possible to your naming conventions so you can match the differences. I saved the files in my working directory for testing purposes.
Created an empty dataframe based on the columns you need and the records you need. In the loop you update a record instead of using cbind. This speeds up the code quite a bit. I kept getting a system.time of 0. Because R doesn't know the size of the dataframe it makes constant copies with rbind and tends to slow down the process if you have a lot of rows. See also this post. Updating the records is a lot faster even if the dataframe is bigger then you need.
edit: I managed to remove everything except the match function from the loop and the rest for the dataframe can be done with vectors / input from other parts of data already available.
I added a parallel option in the code and used stringdistmatrix. This function runs in parallel if available, but also you do not need any loop for the distance calculation.
code section:
library(stringdist)
library(parallel)
#Reading the two files to be compared and making sure that I'm dealing with characters
companies.unstructured <- read.csv("unstructured_companies.csv",
sep = ";",
stringsAsFactors = FALSE)
companies.structured <- read.csv("structured_companies_w_CVR.csv",
sep=";",
stringsAsFactors = FALSE)
#Using the match function to match up all 100% identical companies to avoid unnecessary workload for the Jaro-Winkler loop
companies.unstructured$CVR <- companies.structured$CVR[match(companies.unstructured$Company,
companies.structured$Company)]
companies.exact.match <- companies.unstructured[!is.na(companies.unstructured$CVR), ]
#Creating a subset to work on with the Jaro-Winkler loop.
companies.unstructured.na <- subset(companies.unstructured, is.na(CVR))
distance.method <- "jw"
# Parallel section starts here
# set number of cores to use.
cores = 3
# initialize cluster
cl = makeCluster(cores, type = "SOCK")
# create distance matrix, shortest column will be recycled.
# See stringdistmatrix documentation
dist.name.enh <- stringdistmatrix(tolower(companies.structured$Company),
tolower(companies.unstructured.na$Company),
method = distance.method,
nthread = getOption("sd_num_thread"))
# get the minimun jaro distances from the matrix
min.name.enh <- parApply(cl, dist.name.enh, 2, base::min)
# stop the cluster
stopCluster(cl)
# Parallel section ends here
# create dataframe prefilled with empty values.
match.s1.s2.enh2 <- data.frame(s2.i = rep(NA, nrow(companies.unstructured.na)),
s1.i = rep(NA, nrow(companies.unstructured.na)),
s1Company = rep(NA, nrow(companies.unstructured.na)),
s2Company = rep(NA, nrow(companies.unstructured.na)),
CVR = rep(NA, nrow(companies.unstructured.na)),
adist = rep(NA, nrow(companies.unstructured.na)),
method = rep(NA, nrow(companies.unstructured.na)))
# fill s2.i with NA values for the length needed in the for loop
s2.i <- rep(NA, ncol(dist.name.enh))
# matching up pairs of minimum distance.
for(i in 1:ncol(dist.name.enh)) {
s2.i[i]<-match(min.name.enh[i],dist.name.enh[,i])
}
match.s1.s2.enh2$s2.i <- s2.i
match.s1.s2.enh2$s1.i <- 1:ncol(dist.name.enh)
match.s1.s2.enh2$s1Company <- companies.unstructured.na$Company
match.s1.s2.enh2$adist <- min.name.enh
match.s1.s2.enh2$method <- distance.method
match.s1.s2.enh2$s2Company <- companies.structured$Company[s2.i]
match.s1.s2.enh2$CVR <- companies.structured$CVR[s2.i]