rdplyrtapply

What is the Base R equivalent of this dplyr group_by code?


The R4DS book has the following code block:

library(tidyverse)
by_age2 <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

Is there a simple equivalent to this code in base R? The filter can be replaced with gss_cat[!is.na(gss_cat$age),], but after that I run in to trouble. It's clearly a job for by, tapply, or aggregate, but I've not been able to find the right way. by(gss_2, with(gss_2, list(age, marital)), length) is a step in the right direction, but the output is awful.


Solution

  • We could use proportions on the table output after subsetting to remove the NA (complete.cases) and selecting the columns

    The data is from forcats package. So, load the package and get the data

    library(forcats)
    data(gss_cat)
    

    Use the table/proportions as mentioned above

    by_age2_base <- proportions(table(subset(gss_cat, complete.cases(age), 
           select = c(age, marital))), 1)
    

    -output

    head(by_age2_base, 3)
        marital
    age    No answer Never married   Separated    Divorced     Widowed     Married
      18 0.000000000   0.978021978 0.000000000 0.000000000 0.000000000 0.021978022
      19 0.000000000   0.939759036 0.000000000 0.012048193 0.004016064 0.044176707
      20 0.000000000   0.904382470 0.003984064 0.007968127 0.000000000 0.083665339
    

    -compare with the OP's output

    head(by_age2, 3)
    # A tibble: 3 x 4
    # Groups:   age [2]
        age marital           n   prop
      <int> <fct>         <int>  <dbl>
    1    18 Never married    89 0.978 
    2    18 Married           2 0.0220
    3    19 Never married   234 0.940 
    

    If we need the output in 'long' format, convert the table to data.frame with as.data.frame

    by_age2_base_long <- subset(as.data.frame(by_age2_base), Freq > 0)
    

    Or another option is aggregate/ave (use R 4.1.0)

    subset(gss_cat, complete.cases(age), select = c(age, marital)) |> 
        {\(dat) aggregate(cbind(n = age) ~ age + marital, 
          data = dat, FUN = length)}() |> 
       transform(prop = ave(n, age, FUN = \(x) x/sum(x)))