rdata.tablehierarchical-datataxonomy

Creating data.table rows for all levels of groups/supersets in a hierarchy


I have a taxonomic dataset that looks like the following:

sample_table <- data.table(Phylum = c("Arthropoda", "Arthropoda", "Arthropoda", "Arthropoda"),
                              Class = c("Arachnida",   "Insecta",    "Insecta", "Insecta"),
                              Order = c("Acariformes",   "Coleoptera", "Coleoptera", "Coleoptera"),
                              Family = c(NA, "Staphylinidae", "Staphylinidae", "Staphylinidae"),
                              Genus = c(NA, "Staphylininae", "Staphylininae", "Staphylininae"),
                              Species = c(NA, NA, "Philonthus", "Philonthus"),
                              Site = c(5, 6, 6, 6),
                              Distance = c(0, 0, 0, 5),
                              N = c(58, 1, 5, 3))

# Phylum       Class      Order        Family           Genus           Species     Site     Distance     N
# Arthropoda   Arachnida  Acariformes  <NA>             <NA>            <NA>        5        0            58
# Arthropoda   Insecta    Coleoptera   Staphylinidae    Staphylininae   <NA>        6        0            1
# Arthropoda   Insecta    Coleoptera   Staphylinidae    Staphylininae   Philonthus  6        0            5
# Arthropoda   Insecta    Coleoptera   Staphylinidae    Staphylininae   Philonthus  6        5            3

Phylum, Class, Order, Family, Genus, and Species are all hierarchical taxonomic groups. Site and Distance refer to where the samples were caught. N is the number of samples caught at a given site and distance in a given taxonomic group.

Row 2 (R2)'s N value should include R3's N value, since R2 is a superset that includes R3. However, it currently only includes specimens that couldn't be identified past the Genus level (where Species is NA).

I would like my dataset to resemble the following:

# Phylum       Class      Order        Family          Genus           Species     Site    Distance  N
# Arthropoda   <NA>       <NA>         <NA>            <NA>            <NA>        5       0         58
# Arthropoda   Arachnida  <NA>         <NA>            <NA>            <NA>        5       0         58
# Arthropoda   Arachnida  Acariformes  <NA>            <NA>            <NA>        5       0         58
# Arthropoda   <NA>     <NA>           <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  <NA>           <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  Coleoptera     <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  Coleoptera     Staphylinidae   <NA>            <NA>        6       0         1
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   <NA>        6       0         1
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   Philonthus  6       0         5
# Arthropoda   <NA>     <NA>           <NA>            <NA>            <NA>        6       5         3
# Arthropoda   Insecta  <NA>           <NA>            <NA>            <NA>        6       5         3
# Arthropoda   Insecta  Coleoptera     <NA>            <NA>            <NA>        6       5         3
# Arthropoda   Insecta  Coleoptera     Staphylinidae   <NA>            <NA>        6       5         3
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   <NA>        6       5         3
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   Philonthus  6       5         3

As you can see, there is now a row for every taxonomic superset, for every site-distance pair. Each row's N-value includes specimens that were identified to that row's maximum taxonomic specificity and those identified beyond that row's maximum taxonomic specificity.

I have looked into using tidyr::complete() and a cross-join, but I'm not sure they're what I'm looking for. Unlike most examples I've been able to find, I want to complete my dataset without crossing taxonomic groups that are on separate branches, essentially by adding rows that right-fill taxonomic columns with NA (although there must be a better way than doing this manually). I'm also not sure how to use these strategies to fix my grouping/summing issues.

EDIT: I added a line to the sample dataset to reflect how Distance should be handled (i.e., the same as Site, where specimens sampled at different distances are not aggregated).


Solution

  • A data.table solution. Updated to reflect the question update.

    cols <- colnames(sample_table)[1:6]
    
    f <- function(dt) {
      i <- sum(!is.na(dt[,..cols]))
      if (i > 1) {
        dt <- dt[rep(1L, i)]
        for (j in 2:i) set(dt, i = 1:(j - 1), j = j, NA)
      }
      dt
    }
    sample_table[,f(.SD), 1:nrow(sample_table)][,nrow := NULL][
      ,.(N = sum(N)), Phylum:Distance
    ]
    #>         Phylum     Class       Order        Family         Genus    Species  Site Distance     N
    #>         <char>    <char>      <char>        <char>        <char>     <char> <num>    <num> <num>
    #>  1: Arthropoda      <NA>        <NA>          <NA>          <NA>       <NA>     5        0    58
    #>  2: Arthropoda Arachnida        <NA>          <NA>          <NA>       <NA>     5        0    58
    #>  3: Arthropoda Arachnida Acariformes          <NA>          <NA>       <NA>     5        0    58
    #>  4: Arthropoda      <NA>        <NA>          <NA>          <NA>       <NA>     6        0     6
    #>  5: Arthropoda   Insecta        <NA>          <NA>          <NA>       <NA>     6        0     6
    #>  6: Arthropoda   Insecta  Coleoptera          <NA>          <NA>       <NA>     6        0     6
    #>  7: Arthropoda   Insecta  Coleoptera Staphylinidae          <NA>       <NA>     6        0     6
    #>  8: Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae       <NA>     6        0     6
    #>  9: Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae Philonthus     6        0     5
    #> 10: Arthropoda      <NA>        <NA>          <NA>          <NA>       <NA>     6        5     3
    #> 11: Arthropoda   Insecta        <NA>          <NA>          <NA>       <NA>     6        5     3
    #> 12: Arthropoda   Insecta  Coleoptera          <NA>          <NA>       <NA>     6        5     3
    #> 13: Arthropoda   Insecta  Coleoptera Staphylinidae          <NA>       <NA>     6        5     3
    #> 14: Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae       <NA>     6        5     3
    #> 15: Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae Philonthus     6        5     3