I have a taxonomic dataset that looks like the following:
sample_table <- data.table(Phylum = c("Arthropoda", "Arthropoda", "Arthropoda", "Arthropoda"),
Class = c("Arachnida", "Insecta", "Insecta", "Insecta"),
Order = c("Acariformes", "Coleoptera", "Coleoptera", "Coleoptera"),
Family = c(NA, "Staphylinidae", "Staphylinidae", "Staphylinidae"),
Genus = c(NA, "Staphylininae", "Staphylininae", "Staphylininae"),
Species = c(NA, NA, "Philonthus", "Philonthus"),
Site = c(5, 6, 6, 6),
Distance = c(0, 0, 0, 5),
N = c(58, 1, 5, 3))
# Phylum Class Order Family Genus Species Site Distance N
# Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 1
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 5 3
Phylum
, Class
, Order
, Family
, Genus
, and Species
are all hierarchical taxonomic groups. Site
and Distance
refer to where the samples were caught. N
is the number of samples caught at a given site and distance in a given taxonomic group.
Row 2 (R2)'s N
value should include R3's N
value, since R2 is a superset that includes R3. However, it currently only includes specimens that couldn't be identified past the Genus
level (where Species
is NA).
I would like my dataset to resemble the following:
# Phylum Class Order Family Genus Species Site Distance N
# Arthropoda <NA> <NA> <NA> <NA> <NA> 5 0 58
# Arthropoda Arachnida <NA> <NA> <NA> <NA> 5 0 58
# Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58
# Arthropoda <NA> <NA> <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta <NA> <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta Coleoptera Staphylinidae <NA> <NA> 6 0 1
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 1
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5
# Arthropoda <NA> <NA> <NA> <NA> <NA> 6 5 3
# Arthropoda Insecta <NA> <NA> <NA> <NA> 6 5 3
# Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 5 3
# Arthropoda Insecta Coleoptera Staphylinidae <NA> <NA> 6 5 3
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 5 3
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 5 3
As you can see, there is now a row for every taxonomic superset, for every site-distance pair. Each row's N-value includes specimens that were identified to that row's maximum taxonomic specificity and those identified beyond that row's maximum taxonomic specificity.
I have looked into using tidyr::complete()
and a cross-join, but I'm not sure they're what I'm looking for. Unlike most examples I've been able to find, I want to complete my dataset without crossing taxonomic groups that are on separate branches, essentially by adding rows that right-fill taxonomic columns with NA (although there must be a better way than doing this manually). I'm also not sure how to use these strategies to fix my grouping/summing issues.
EDIT: I added a line to the sample dataset to reflect how Distance
should be handled (i.e., the same as Site
, where specimens sampled at different distances are not aggregated).
A data.table
solution. Updated to reflect the question update.
cols <- colnames(sample_table)[1:6]
f <- function(dt) {
i <- sum(!is.na(dt[,..cols]))
if (i > 1) {
dt <- dt[rep(1L, i)]
for (j in 2:i) set(dt, i = 1:(j - 1), j = j, NA)
}
dt
}
sample_table[,f(.SD), 1:nrow(sample_table)][,nrow := NULL][
,.(N = sum(N)), Phylum:Distance
]
#> Phylum Class Order Family Genus Species Site Distance N
#> <char> <char> <char> <char> <char> <char> <num> <num> <num>
#> 1: Arthropoda <NA> <NA> <NA> <NA> <NA> 5 0 58
#> 2: Arthropoda Arachnida <NA> <NA> <NA> <NA> 5 0 58
#> 3: Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58
#> 4: Arthropoda <NA> <NA> <NA> <NA> <NA> 6 0 6
#> 5: Arthropoda Insecta <NA> <NA> <NA> <NA> 6 0 6
#> 6: Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 0 6
#> 7: Arthropoda Insecta Coleoptera Staphylinidae <NA> <NA> 6 0 6
#> 8: Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 6
#> 9: Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5
#> 10: Arthropoda <NA> <NA> <NA> <NA> <NA> 6 5 3
#> 11: Arthropoda Insecta <NA> <NA> <NA> <NA> 6 5 3
#> 12: Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 5 3
#> 13: Arthropoda Insecta Coleoptera Staphylinidae <NA> <NA> 6 5 3
#> 14: Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 5 3
#> 15: Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 5 3