I have a file with an identifier and a value:
ABC123 111111
ABC123 111111
ABCDEF 333333
ABCDEF 111111
CCCCCC 333333
ABC123 222222
DEF123 444444
DEF123 444444
Both columns contain duplicate values, but I need to count lines which have the same ID (first column) and a unique value (second column). This would make the output from the above input:
ABCDEF 2
ABC123 2
DEF123 1
CCCCCC 1
...where the first column is the ID and the second column is the count of unique values in the second column. In other words, I need to find out how many unique values exist for a given ID.
The closest I've come is this but all it does is count the first column's unique values:
cut -d " " -f1 "file.txt" | uniq -cd | sort -nr | head
How would I do something like this in Bash?
Is this close enough?
$ sort -u file.txt | cut -d' ' -f1 | uniq -c
2 ABC123
2 ABCDEF
1 CCCCCC
1 DEF123
You can further filter it with a | grep -vw '1'
to mimic the HAVING COUNT(DISTINCT value) > 1
semantics and eliminate the last two rows from the output in this example (assuming that 1
is not a legal value for an identifier!).
And you can of course reverse the column order in several ways. E.g.
$ sort -u file.txt | # sort and eliminate multiple occurrences of the same '<identifier> <value>' pair
cut -d' ' -f1 | # keep only the identifier
uniq -c | # collapse and count occurrences of the same identifier
grep -vw '1' | # eliminate rows containing the word '1', assuming this can only be a count value, never an identifier!
awk '{print $2 " " $1}' # reverse column order to show '<identifier> <count>'
ABC123 2
ABCDEF 2