google-refine

Merge ALMOST identical data rows


I have a large amount of data (UK & US Postal addresses) 100,000+, that contains duplicate or ALMOST identical data rows (with 5 columns) in the near identical rows four out of the five columns have exact matches of data for example:- AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222 AAAA BBBB CCCCCC CCCCCCCC 11.111 22.222 DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444 DDDD EEEE FF FFFFF FFFFF 33.33 44.444 GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666 GGGG HHHH IIII IIIII 55.555 66.666 i am trying to use Google Refine to remove these duplicate (or near duplicate rows) i just cant managed it what i want to end up with is:- AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222 DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444 GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666 e.g. Discard the column with "Shorter" data length


Solution

  • You can achieve this by doing following steps - 1. sort on column 1 2. sort on column 2 3. sort on column 4 4. sort on column 5 5. Reorder rows permanently (open on top) now you would see that - all the rows sorted permanently. do blank down on column 1.
    Result would be -
    ===============================================================

    AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222
    BBBB CCCCCC CCCCCCCC 11.111 22.222
    DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444
    EEEE FF FFFFF FFFFF 33.33 44.444
    GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666
    HHHH IIII IIIII 55.555 66.666

    ===================================================================  
    

    now select all the rows with blank on first column and delete all the rows.