rdplyrtidyrconditional-operator

R - Sum rows from different column based on condition in grouped values


I have a large dataset of different sites inside and out of Natura 2000 network.

An example:

df

Name Total_Surface N2000 SurfaceN2000
A 1 Yes 0.5
B 5 No NA
C 11 No NA
D 10 Yes 5

I want to have the sum of surfaces grouped by the N2000 factor (one total surface for "Yes", another for "No"). However, if the site is inside the N2000 network (N2000=Yes), I want to take the data from the column "SurfaceN2000".

How can I make a sum that depending on the N2000 factor it takes the values from Total_Surface or SurfaceN2000?

I know I could create a column with the sum of Total_Surface and another one with SurfaceN2000 but I was wondering if there was a more automatic solution.

I tried a number of things but they didn't work. The last thing I tried was

df %>% 
  group_by(N2000) %>%
  summarise(surface=
             case_when(N2000== "No" ~ sum(Total_Surface,na.rm = TRUE), 
                       N2000== "Yes" ~ sum(SurfaceN2000,na.rm = TRUE)))

I also tried with filter() or rbind() but it didn't work either.

I was expecting an output table like:

N2000 Surface
Yes 5.5
No 16

Thank you very much in advance! :)


Solution

  • Consider pivoting your data (according to a more sophisticated rule than demonstatred here).

    tidyr::pivot_longer(d0, cols = c(Total_Surface, SurfaceN2000)) |>
      dplyr::summarise(Surface = sum(value, na.rm = TRUE), .by = (c(N2000, name)))
    
    # A tibble: 4 × 3
      N2000 name          Surface
      <chr> <chr>           <dbl>
    1 Yes   Total_Surface    11  
    2 Yes   SurfaceN2000      5.5
    3 No    Total_Surface    16  
    4 No    SurfaceN2000      0  
    

    You could add a filter to return the desired subset.


    Data.

    > dput(d0)
    structure(list(Name = c("A", "B", "C", "D"), Total_Surface = c(1L, 
    5L, 11L, 10L), N2000 = c("Yes", "No", "No", "Yes"), SurfaceN2000 = c(0.5, 
    NA, NA, 5)), class = "data.frame", row.names = c(NA, -4L))