rdplyrtidyversecase

Difference if_else (returning too long vector) and case_when when using dplyr


This is a totally constructed example. It is just meant to understand the conceptual differences.

I am running this code

library(palmerpenguins)
penguins %>% 
  group_by(species) %>% 
  filter(
    if_else(
      species == "Adelie", 
      if_else(
        n_distinct(island) > 1,
        row_number() == 1,
        row_number() == 2
      ),
      row_number() %in% 1:2
    ))

Thinking that when the species is Adelie and in this species there is only one island it would return the first, otherwise the second row. If the species is not Adelie, it returns the first two rows. However I get this error:

Error in `filter()`:
ℹ In argument: `if_else(...)`.
ℹ In group 1: `species = Adelie`.
Caused by error in `if_else()`:
! `true` must have size 1, not size 152.

Which I do not understand completely, because the row_number() == 1 return either FALSEor TRUE on per-line basis doesnt it?

I know I run it with case_when like this:

penguins %>%
  group_by(species) %>%
  filter(case_when(
    species == "Adelie" & n_distinct(island) > 1 ~ row_number() == 1,
    species == "Adelie" ~ row_number() == 2, 
    .default = row_number() %in% 1:2 
  ))

And it works. I thought if_else and case_when were vectorized. But I guess I'm missing something basic here. I'd be super helpful for any hint.


Solution

  • The true= and false= legs of if_else will be recycled to the length of the condition argument but condition will not be recycled to the length of true and false. Also recycling means making longer. Recycling will not make something shorter.

    The condition and the true and false legs must be the same lengths after recycling and since the n_distinct(island) > 1 condition has length 1 which will not be recycled whereas the true and false legs have length > 1 we have an error.

    We could replace the if_else with this where we have used rep to recycle the condition ourselves to the length of the legs:

    if_else(rep(n_distinct(island) > 1, n()), row_number() == 1, row_number() == 2)
    

    but what we really want here is if rather than if_else since if is normally used when the condition is a scalar:

    if (n_distinct(island) > 1) row_number() == 1 else row_number() == 2
    

    or possibly

    row_number() == (if (n_distinct(island) > 1) 1 else 2)
    

    so we have

    penguins %>% 
      group_by(species) %>% 
      filter(
        if_else(
          species == "Adelie", 
          if (n_distinct(island) > 1) row_number() == 1 else row_number() == 2,
          row_number() %in% 1:2
        ))