I have a file with a large number of rows. Each row contains 5 columns delimited by tabs. I want to find all rows that have the same value for the first 4 columns but have different values for the 5th column.
name age address phone city
eric 5 add1 1234 City1
jerry 5 add1 1234 City2
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
niko 5 add1 1234 City7
The result for this table should be
eric 5 add1 1234 City1
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
I tried using uniq -u -f4
after sort
but that ignores the first 4 fields which in this case would return all the rows.
I would be inclined to use awk
for this.
script.awk
{ x = count[$1,$2,$3,$4]++; line[$1,$2,$3,$4,x] = $0 }
END { for (key in count)
{
kc = count[key]
if (kc > 1)
{
for (i = 0; i < kc; i++)
{
print line[key,i]
}
}
}
}
For each line, increment the count of the number of rows with the first four field values as the key. Save the current line in the correct sequence. At the end, for each key where the count is more than one, print each of the saved lines for that key.
$ awk -f script.awk data
jax 5 add1 1234 City5
jax 5 add1 1234 City6
eric 5 add1 1234 City1
eric 5 add1 1234 City3
eric 5 add1 1234 City4
$
Note that this generates the keys in a different order from which they appear in the file (the first eric, 5, add1, 1234
entry occurs before the first jax, 5, add1, 1234
entry).
It would be possible to resolve that if it is necessary to do so.
script2.awk
{ x = count[$1,$2,$3,$4]++;
line[$1,$2,$3,$4,x] = $0
if (x == 0)
seq[n++] = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
}
END { for (s = 0; s < n; s++)
{
key = seq[s]
kc = count[key]
if (kc > 1)
{
for (i = 0; i < kc; i++)
{
print line[key,i]
}
}
}
}
SUBSEP
is the character used to separate components of a multi-item key, so the assignment to seq[n++]
records the value used as the index in count[$1,$2,$3,$4]
. The seq
array records each key (the first four columns) in the order in which they appear. Stepping through that array in sequence gives the keys in the order in which the first entry appears.
$ awk -f script2.awk data
eric 5 add1 1234 City1
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
$
The code above keeps a lot of data in memory. It has a complete copy of each line in the data files; it has a key with the first four fields; it has another key with the four fields plus an integer. For most practical purposes, that's 3 copies of the data. If the date files are large, that could be a problem. However, given that the sample data has jerry
's row appearing in the middle of eric
's rows, it is not possible to do much better — unless the data is sorted first. Then you know that all the related rows are together in the file, and you can process it much more simply.
script3.awk
{
new_key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (new_key == old_key)
{
if (old_line != "") { print old_line; old_line = "" }
print $0
}
else
{
old_line = $0
old_key = new_key
}
}
$ sort data | awk -f script3.awk
eric 5 add1 1234 City1
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
$
Of course, it is coincidence that eric
precedes jax
in alphabetic sequence; by sorting, you lose the original data sequence. But the script3.awk
script keeps at most two keys and one line in memory, which isn't going to stress anything in terms of memory. Adding the sort time still probably gives you measurable savings over the original processing mechanism.
If the original order is critical, you have to do much more work. I think it involves numbering each line in the original file, sorting using the line number as a fifth key after the first four keys group the same keys together, and then identify each group of rows with the same four key values with the same row number, then sort again on the group number and the sequence number within the group, and feeding that to a modified version of the processing in the script3.awk
script. But this still might be better than the original if the files are in the gigabyte range. However, the only way to be sure is to do measurements on realistically sized examples.
For example:
nl data |
sort -k2,2 -k3,3 -k4,4 -k5,5 -k1,1n |
awk '{ new_key = $2 SUBSEP $3 sUBSEP $4 SUBSEP $5
if (old_key != new_key) { grp_seq = $1 }
print grp_seq, $0
old_key = new_key
}' |
sort -k1,1n -k2,2n
This much generates:
1 1 name age address phone city
2 2 eric 5 add1 1234 City1
2 4 eric 5 add1 1234 City3
2 5 eric 5 add1 1234 City4
3 3 jerry 5 add1 1234 City2
6 6 jax 5 add1 1234 City5
6 7 jax 5 add1 1234 City6
8 8 niko 5 add1 1234 City7
You can then apply a modified version of script3.awk
that ignores $1
and $2
to generate the desired output. Or you could run the output shown through a program that stripped the two leading columns off.