bashawksed

Count unique values in one column based on a duplicate in another column


I have a file with an identifier and a value:

ABC123 111111
ABC123 111111
ABCDEF 333333
ABCDEF 111111
CCCCCC 333333
ABC123 222222
DEF123 444444
DEF123 444444

Both columns contain duplicate values, but I need to count lines which have the same ID (first column) and a unique value (second column). This would make the output from the above input:

ABCDEF 2
ABC123 2
DEF123 1
CCCCCC 1

...where the first column is the ID and the second column is the count of unique values in the second column. In other words, I need to find out how many unique values exist for a given ID.

The closest I've come is this but all it does is count the first column's unique values:

cut -d " " -f1 "file.txt" | uniq -cd | sort -nr | head

How would I do something like this in Bash?


Solution

  • Is this close enough?

    $ sort -u file.txt | cut -d' ' -f1 | uniq -c
       2 ABC123
       2 ABCDEF
       1 CCCCCC
       1 DEF123
    

    You can further filter it with a | grep -vw '1' to mimic the HAVING COUNT(DISTINCT value) > 1 semantics and eliminate the last two rows from the output in this example (assuming that 1 is not a legal value for an identifier!).

    And you can of course reverse the column order in several ways. E.g.

    $ sort -u file.txt  |        # sort and eliminate multiple occurrences of the same '<identifier> <value>' pair
        cut -d' ' -f1   |        # keep only the identifier
        uniq -c         |        # collapse and count occurrences of the same identifier
        grep -vw '1'    |        # eliminate rows containing the word '1', assuming this can only be a count value, never an identifier!
        awk '{print $2 " " $1}'  # reverse column order to show '<identifier> <count>'
    ABC123 2
    ABCDEF 2