I have a dataset with three columns that should theoretically have the same number of unique observations.
Here is a sample of the data:
speciesID common_name species
s001 common lizard Zootoca vivipara
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s003 red grouse Lagopus lagopus scoticus
s003 red grouse Lagopus lagopus scoticus
s004 elk Cervus elaphus
The full dataset can be found here.
But, when I check the number of unique observations, they don't match up.
df %>% as_tibble() %>% count(speciesID) %>% nrow() #148 unique values
df %>% as_tibble() %>% count(common_name) %>% nrow() #150 unique values
df %>% as_tibble() %>% count(species) %>% nrow() #147 unique values
Is there a way to figure out which where the 2 missing unique values are from the speciesID
column and the 3 missing unique values are from the species
column?
Ideally, I would like to be able to identify the problem rows so that I can go back into the raw data and fix the errors (i.e., there should be 150 unique records).
My hope is that there is a way to do this in R instead of manually checking ~700 rows of data.
I did explore using anti_join
, but this hasn't been very successful.
I work in R and am most comfortable with dplyr
.
One way you can find what is duplicative:
aa <- readr::read_csv("clean_data_species.csv")[,-1]
distinct(aa) |>
filter(.by = speciesID, n() > 1)
# # A tibble: 6 × 3
# speciesID common_name species
# <chr> <chr> <chr>
# 1 s011 banner-tailed kangaroo rat Dipodomys spectabilis
# 2 s011 dwarf mongoose Helogale parvula
# 3 s030 north american red squirrel Tamiasciurus hudsonicus
# 4 s030 eurasian red squirrel Sciurus vulgaris
# 5 s045 northern spotted owl Strix occidentalis caurina
# 6 s045 grey red-backed vole/grey-sided vole Clethrionomys rufocanus
distinct(aa) |>
filter(.by = common_name, n() > 1)
# # A tibble: 2 × 3
# speciesID common_name species
# <chr> <chr> <chr>
# 1 s015 great tit Parus major
# 2 s073 great tit Parus major
distinct(aa) |>
filter(.by = species, n() > 1)
# # A tibble: 8 × 3
# speciesID common_name species
# <chr> <chr> <chr>
# 1 s015 great tit Parus major
# 2 s020 gray jay Perisoreus canadensis
# 3 s073 great tit Parus major
# 4 s074 pied babbler Turdoides bicolor
# 5 s109 eurasian kestrel Falco tinnunculus
# 6 s110 southern pied babblers Turdoides bicolor
# 7 s129 canada jay Perisoreus canadensis
# 8 s106 common kestrel Falco tinnunculus
Each of the rows indicates duplicated values in one of the columns. For instance, in speciesID
, we have three values that have 2 rows in them: s011
, s030
, and s045
, and we can see here the specific values for common_name
and species
for those pairs.
Similarly, for common_name
(only one level is duplicated, nearly perfectly in fact) and species
(four pairs, some with commonalities).