rdplyranti-join

Identify missing unique values across multiple columns in R


I have a dataset with three columns that should theoretically have the same number of unique observations.

Here is a sample of the data:

speciesID                          common_name                       species
        s001                        common lizard              Zootoca vivipara
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s003                           red grouse      Lagopus lagopus scoticus
        s003                           red grouse      Lagopus lagopus scoticus
        s004                                  elk                Cervus elaphus

The full dataset can be found here.

But, when I check the number of unique observations, they don't match up.

df %>% as_tibble() %>% count(speciesID) %>% nrow() #148 unique values       
df %>% as_tibble() %>% count(common_name) %>% nrow() #150 unique values     
df %>% as_tibble() %>% count(species) %>% nrow() #147 unique values

Is there a way to figure out which where the 2 missing unique values are from the speciesID column and the 3 missing unique values are from the species column?

Ideally, I would like to be able to identify the problem rows so that I can go back into the raw data and fix the errors (i.e., there should be 150 unique records).

My hope is that there is a way to do this in R instead of manually checking ~700 rows of data.

I did explore using anti_join, but this hasn't been very successful.

I work in R and am most comfortable with dplyr.


Solution

  • One way you can find what is duplicative:

    aa <- readr::read_csv("clean_data_species.csv")[,-1]
    distinct(aa) |>
      filter(.by = speciesID, n() > 1)
    # # A tibble: 6 × 3
    #   speciesID common_name                          species                   
    #   <chr>     <chr>                                <chr>                     
    # 1 s011      banner-tailed kangaroo rat           Dipodomys spectabilis     
    # 2 s011      dwarf mongoose                       Helogale parvula          
    # 3 s030      north american red squirrel          Tamiasciurus hudsonicus   
    # 4 s030      eurasian red squirrel                Sciurus vulgaris          
    # 5 s045      northern spotted owl                 Strix occidentalis caurina
    # 6 s045      grey red-backed vole/grey-sided vole Clethrionomys rufocanus   
    distinct(aa) |>
      filter(.by = common_name, n() > 1)
    # # A tibble: 2 × 3
    #   speciesID common_name species    
    #   <chr>     <chr>       <chr>      
    # 1 s015      great tit   Parus major
    # 2 s073      great tit   Parus major
    distinct(aa) |>
      filter(.by = species, n() > 1)
    # # A tibble: 8 × 3
    #   speciesID common_name            species              
    #   <chr>     <chr>                  <chr>                
    # 1 s015      great tit              Parus major          
    # 2 s020      gray jay               Perisoreus canadensis
    # 3 s073      great tit              Parus major          
    # 4 s074      pied babbler           Turdoides bicolor    
    # 5 s109      eurasian kestrel       Falco tinnunculus    
    # 6 s110      southern pied babblers Turdoides bicolor    
    # 7 s129      canada jay             Perisoreus canadensis
    # 8 s106      common kestrel         Falco tinnunculus    
    

    Each of the rows indicates duplicated values in one of the columns. For instance, in speciesID, we have three values that have 2 rows in them: s011, s030, and s045, and we can see here the specific values for common_name and species for those pairs.

    Similarly, for common_name (only one level is duplicated, nearly perfectly in fact) and species (four pairs, some with commonalities).