I have a large amount of data (UK & US Postal addresses) 100,000+, that contains duplicate or ALMOST identical data rows (with 5 columns)
in the near identical rows four out of the five columns have exact matches of data
for example:-
AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222
AAAA BBBB CCCCCC CCCCCCCC 11.111 22.222
DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444
DDDD EEEE FF FFFFF FFFFF 33.33 44.444
GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666
GGGG HHHH IIII IIIII 55.555 66.666
i am trying to use Google Refine to remove these duplicate (or near duplicate rows)
i just cant managed it
what i want to end up with is:-
AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222
DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444
GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666
e.g. Discard the column with "Shorter" data length
You can achieve this by doing following steps -
1. sort on column 1
2. sort on column 2
3. sort on column 4
4. sort on column 5
5. Reorder rows permanently (open on top)
now you would see that - all the rows sorted permanently.
do blank down on column 1.
Result would be -
===============================================================
AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222
BBBB CCCCCC CCCCCCCC 11.111 22.222
DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444
EEEE FF FFFFF FFFFF 33.33 44.444
GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666
HHHH IIII IIIII 55.555 66.666
===================================================================
now select all the rows with blank on first column and delete all the rows.