linuxshellunixduplicate-datauniq

How to find rows in a file with same values for specific columns using unix commands?


I have a file with a large number of rows. Each row contains 5 columns delimited by tabs. I want to find all rows that have the same value for the first 4 columns but have different values for the 5th column.

name     age    address    phone    city
eric      5      add1      1234     City1
jerry     5      add1      1234     City2
eric      5      add1      1234     City3
eric      5      add1      1234     City4
jax       5      add1      1234     City5
jax       5      add1      1234     City6
niko      5      add1      1234     City7

The result for this table should be

 eric      5      add1      1234     City1
 eric      5      add1      1234     City3
 eric      5      add1      1234     City4
 jax       5      add1      1234     City5
 jax       5      add1      1234     City6

I tried using uniq -u -f4 after sort but that ignores the first 4 fields which in this case would return all the rows.


Solution

  • I would be inclined to use awk for this.

    script.awk

    { x = count[$1,$2,$3,$4]++; line[$1,$2,$3,$4,x] = $0 }
    END {   for (key in count)
            {
                kc = count[key]
                if (kc > 1)
                {
                    for (i = 0; i < kc; i++)
                    {
                        print line[key,i]
                    }
                }
            }
        }
    

    For each line, increment the count of the number of rows with the first four field values as the key. Save the current line in the correct sequence. At the end, for each key where the count is more than one, print each of the saved lines for that key.

    Sample run

    $ awk -f script.awk data
    jax       5      add1      1234     City5
    jax       5      add1      1234     City6
    eric      5      add1      1234     City1
    eric      5      add1      1234     City3
    eric      5      add1      1234     City4
    $
    

    Note that this generates the keys in a different order from which they appear in the file (the first eric, 5, add1, 1234 entry occurs before the first jax, 5, add1, 1234 entry).

    It would be possible to resolve that if it is necessary to do so.

    script2.awk

    {   x = count[$1,$2,$3,$4]++;
        line[$1,$2,$3,$4,x] = $0
        if (x == 0)
            seq[n++] = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
    }
    END {   for (s = 0; s < n; s++)
            {
                key = seq[s]
                kc = count[key]
                if (kc > 1)
                {
                    for (i = 0; i < kc; i++)
                    {
                        print line[key,i]
                    }
                }
            }
        }
    

    SUBSEP is the character used to separate components of a multi-item key, so the assignment to seq[n++] records the value used as the index in count[$1,$2,$3,$4]. The seq array records each key (the first four columns) in the order in which they appear. Stepping through that array in sequence gives the keys in the order in which the first entry appears.

    Sample run

    $ awk -f script2.awk data
    eric      5      add1      1234     City1
    eric      5      add1      1234     City3
    eric      5      add1      1234     City4
    jax       5      add1      1234     City5
    jax       5      add1      1234     City6
    $
    

    Preprocessing the data to save memory and speed up processing

    The code above keeps a lot of data in memory. It has a complete copy of each line in the data files; it has a key with the first four fields; it has another key with the four fields plus an integer. For most practical purposes, that's 3 copies of the data. If the date files are large, that could be a problem. However, given that the sample data has jerry's row appearing in the middle of eric's rows, it is not possible to do much better — unless the data is sorted first. Then you know that all the related rows are together in the file, and you can process it much more simply.

    script3.awk

    {   
        new_key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
        if (new_key == old_key)
        {
            if (old_line != "") { print old_line; old_line = "" }
            print $0
        }
        else
        {
            old_line = $0
            old_key = new_key
        }
    }
    

    Sample run

    $ sort data | awk -f script3.awk
    eric      5      add1      1234     City1
    eric      5      add1      1234     City3
    eric      5      add1      1234     City4
    jax       5      add1      1234     City5
    jax       5      add1      1234     City6
    $
    

    Of course, it is coincidence that eric precedes jax in alphabetic sequence; by sorting, you lose the original data sequence. But the script3.awk script keeps at most two keys and one line in memory, which isn't going to stress anything in terms of memory. Adding the sort time still probably gives you measurable savings over the original processing mechanism.

    If the original order is critical, you have to do much more work. I think it involves numbering each line in the original file, sorting using the line number as a fifth key after the first four keys group the same keys together, and then identify each group of rows with the same four key values with the same row number, then sort again on the group number and the sequence number within the group, and feeding that to a modified version of the processing in the script3.awk script. But this still might be better than the original if the files are in the gigabyte range. However, the only way to be sure is to do measurements on realistically sized examples.

    For example:

    nl data |
    sort -k2,2 -k3,3 -k4,4 -k5,5 -k1,1n |
    awk '{ new_key = $2 SUBSEP $3 sUBSEP $4 SUBSEP $5
           if (old_key != new_key) { grp_seq = $1 }
           print grp_seq, $0
           old_key = new_key
         }' |
    sort -k1,1n -k2,2n
    

    This much generates:

    1      1    name     age    address    phone    city
    2      2    eric      5      add1      1234     City1
    2      4    eric      5      add1      1234     City3
    2      5    eric      5      add1      1234     City4
    3      3    jerry     5      add1      1234     City2
    6      6    jax       5      add1      1234     City5
    6      7    jax       5      add1      1234     City6
    8      8    niko      5      add1      1234     City7
    

    You can then apply a modified version of script3.awk that ignores $1 and $2 to generate the desired output. Or you could run the output shown through a program that stripped the two leading columns off.