pythonpandascluster-analysisrecord-linkage

Efficient Method to Group by Columns


I've been trying to deduplicate strings of names over a very, very large dataset. I've been using the recordlinkage library. However, while it does generate a nice list of paired indices, it does not provide any way to re-group them. I've run several similarity measures on the strings, and then grouped them using

vectors.loc[(vectors['phonetic_similarity'] > 0.980) & (vectors['unsorted_similarity'] > 0.95)]

This generates a dataframe which contains all of the names which are most likely to be true matches with each other. I then decided, in order to standardize the names, that using the most common name for a certain person would be most appropriate, as the most common representation would likely not be a typo.

However, attempting to group together the names in order to actually determine what the mode is has proven an incredibly challenging task.

level_0 level_1 name_a name_b
533010 821030 John Smith John Smit h
821030 346721 John Smit h John Smith
411234 441422 Jack Anderson Jack Anderson
912034 123468 Jack Anderson Jack Anderson
162162 974930 Annie Lawson Anie Lawson
921234 974930 Annie Lawson Anie Lawson
133435 123468 Jack Andersan Jack Anderson
441422 123468 Jack Anderson Jack Anderson
234561 162162 Annie Lawson Annie Lawson

Of course, the names have been replaced, and this dataset is not over one million rows long. However, for the sake of example, this does illustrate the issue I'm facing. I tried using pandas' 'groupby', but this returns a groupby object and not a new dataframe, or series, which is what I need to obtain the mode (unless there is some way to use groupby that I am unaware of). I also tried using .loc in this fashion:

try:
    while(not runner.empty):
        pair = runner.iloc[0, 1:3]
        matches = runner.loc[(pair['level_0'] == runner['level_0']) |
                             (pair['level_1'] == runner['level_0']) |
                             (pair['level_0'] == runner['level_1']) |
                             (pair['level_1'] == runner['level_1'])]
        groups.append(matches)
        runner = runner.drop(matches.index.to_numpy())
except IndexError:
    print("a miracle!")
finally:
    display(groups)

Though it obviously hangs after a certain point. This method is also flawed for another reason because it does not loc within the group it extracts, thus it fails to properly find all matches for a certain "chain" of index numbers. However, I couldn't think of an efficient way to do this that didn't involve a lot of looping which would mostly be fruitless and thus most likely result in execution not finishing when run over the full 1 million+ pairs.

I also thought of trying to use clustering of some type, but after a few hours of fiddling with it, decided that it was most likely overkill, though it seems to be almost the perfect application for it, I thought really that the algorithms might instead just be clustering the numbers based on the distances between their values, rather than the membership of these numbers within either column, if that makes sense, and I couldn't think of any sensible way to remedy that.

As you can likely tell, I'm fairly new to data science, and could really use any help at all with this issue. I have been working at it for several days now and the lack of meaningful progress I've made is quite frustrating.

EDIT: I'm still working on this issue. I've moved to attempting to use an 'isin' strategy. First, I .loc all rows which contain the values in level_0 and level_1 in the first row. Then, I pass this collection of 'first level' matches to an .isin call which returns all of the rows which contain every 'first level' match. I continue doing this until the call to .isin doesn't return something larger than last time. This method is... even slower. Is it even possible to do this sort of operation quickly on a large dataframe?


Solution

  • Right so, this actually turned out to be a graphing problem. Classic. After 3 days of bashing my head against the wall, it finally dawned on me that the reason the problem was so difficult to solve was because I was looking at it the wrong way. Think of each pair of level_0 and level_1 as an edge in the graph of unique values across both columns. The graph is undirected because of the nature of the problem: name a being similar to name b means name b necessitates that name b also be similar to name a. Drawing then all the nodes and edges, we arrive here: undirected graph of my example

    The graph of my example table reveals a simple truth: all 'chain linked' records about a specific individual are in reality connected components within a graph. The problem can therefore most efficiently be solved by detecting all groups of connected components. This can most efficiently be done by graph libraries; initially I was going to use NetworkX but then I discovered how horrendously inefficient it was. I ended up going with igraph,though I'm aware NetworKit is just a bit more efficient. The following is the code I used to solve my issue. If anyone else runs into a similar issue, I hope they somehow find this 0 score post.

    display('preparing nodes...')
    indices = likely[['level_0','level_1']]
    #the 'nodes' in our graph are all the unique elements across both level_0 and level_1.
    nodes = pd.concat([pd.Series(indices['level_0'].unique()),pd.Series(indices['level_1'].unique())]).drop_duplicates().tolist()
    #our 'edges' are (level_0, level_1), and are obviously undirected. However, since igraph and even networkit need node ids to be adjacent,
    #we need to do a little funny business so that we know that ex. node '0' actually refers to index 1043307 on the original graph.
    #and by 'funny business', i mean we have to make a gigantic lookup table. note: if we were to be adding/removing nodes this would be unfeasible
    node_lookup = pd.Series(nodes)
    rev_node_lookup = pd.Series(node_lookup.index.values, index=node_lookup)
    #for efficiency of lookup, the node at index i is represented by i on the graph & vv. this is because .iat() is absurdly fast
    display('preparing edges...')
    edge_tuples = list(indices.itertuples(index=False, name=None))
    edges = [(rev_node_lookup.at[a], rev_node_lookup.at[b]) for a, b in edge_tuples]
    display('building graph...')
    G = ig.Graph(edges) #unweighted,undirected by default
    display('graph built! determining connected components...')
    clusters = G.clusters(mode='weak')  #'clusters' finds connected components. given its undirected, weakly conn also = strongly conn
    display('complete! converting all numbers back...')
    groups = list(clusters)
    #all we need to do now is replace the values in our clusters with the real indexes, and then we can find the modes of each!
    groups = [[node_lookup.iat[i] for i in group] for group in groups]
    display('done! all duplicates are now arranged in groups.')
    

    The creation of the edges could most likely be done more efficiently, but like I said, this runs in what I might describe as 'reasonable time' (about 3 minutes), and I'm really quite ready to move on to actually working on my project, so for now it will do just fine.