google-refine

google refine: use facet tools to infer map between two columns


I've been searching but haven't found how to do this in refine.

I've got two columns of unique IDS. For each a in A, I want to find the top 10 closest matches in B.

My backup plan is to just use Levenshtein to iterate ... but Refine has such a nice iterface and many more algorithms implemented that I was hoping to be able to do some of the work using it.

Or is there another tool for doing this?


Solution

  • Did you know you can use clustering algorithm like fingerprint or ngramFingerprint (source) out of the clustering interface in Refine?

    Using you IDS field, create a new column based on this column with the following expression: ngramFingerprint(value)

    You can now cross with your other data set on this new column. This might help to get more matches.