I would like to compare 2 large tables of objects from two different databases:
Each object is a scientific publication with 30 properties. The aim is to identify duplicate titles. But as the titles have not been entered in exactly the same way in the two databases, I want to make my comparison on the publication title using the Levenshtein distance implemented in this answer: Compare Strings Javascript Return %of Likely
I've tested two ways of doing this:
Both processes are extremely time-consuming. Do you think there's a more efficient way of doing this?
Thank you very much for your answers.
This question ties back to a basic performance-bottleneck inherent to all programming languages. I will assume your objective is to "simply" lower the processing time by any means necessary (so not just optimise the code it self but the way the calculations are done). Because there isn't much else to say...
In that case, per my experience, those are the ways most would solve it:
threads
at it. Basically, use parallelism to speed up the process. Your situation can be relatively easily threaded.In this answer, I'll give you solutions, not implementations. You may subsequently research more on each of them, maybe implement a mix of them.
This solution is often the go-to one to any slow-processing problem one might have, but it isn't always the best one and often adds substantial overhead. There are numerous posts and articles about that topic, so I won't dive into the details here, but in your case, worker threads
would probably help speed up the processing (I don't think you need child process
es, but I'll namedrop them here just in case). The only hurdle will be figuring out how to split the array, process it, and merge it back.
A bit like scientific publications (ironically), building on top of the state of art is usually a good idea. Although I have nothing to say about the implementation that was given, I am unsure of its performance. There is a package called fastest-levenshtein
which gives benchmarks of performance. Depending on how fast (or slow) it is compared to the function you already have, you might be better off using the package's implementation.
But as far as that goes, there isn't much to optimise.
This goes a bit in theorics, but you could consider a stepped process for your comparison to minimise the number of operations. This way of doing things might not be the fastest, especially for fewer items, but done right, it can lead to faster times. As a quick thought, you could organise the following processing pipeline:
Continue stripping the array
This method, though definitely not the best nor the fastest can still help stripping down the number of operations to a minimum by removing obvious cases. It wouldn't be my first thought, but it's a thought. I'm adding this option for completeness sake mostly.