[SOLVED] How do I group companies having different names but are essentially the same semantically?

How do I group companies having different names but are essentially the same semantically?

I am doing competitor analysis using Open Government Data from UK public sector. But there are some anomalies in my results. When I am grouping the contracts by the company names, there are a lot of issues like companies are misspelt or they vary in their names.e.g HP, Hewlett-Packard, Hewlett-Packard Limited , ibm, ibm UK, ibm UK limited etc. The thing is I already ran my code and fixed the results manually. Now I have changed some parts of the code and need to run it again. But I can't go back doing the same thing again as it's costly. At the moment I am thinking about writing a general rule that will sort these companies alphabetically, and merge them when they match on the first few words. But it's not a full-proof approach as HP and Hewlett-Packard will be different. Has anyone done any similar kind of work before or can reference me to their work please. I would be grateful. Thanks.

Solution

This is a problem that I have worked in past but I did it for different domain. You could start with an online source which gives list of companies and their abbreviations, scrape them and store them in some format (like hashmap). Now you can use the abbreviations to find a substring match with both original and abbr. word with some threshold (lets say 90%).

Specific to your case you can start scraping this site http://www.abbreviations.com/acronyms/FIRMS using JSOUP. This has a very rich source of company abbreviations. If this list doesnt suffice, you would have to look for some other sources. Hope this helps.