The R4DS book has the following code block:
library(tidyverse)
by_age2 <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
Is there a simple equivalent to this code in base R? The filter
can be replaced with gss_cat[!is.na(gss_cat$age),]
, but after that I run in to trouble. It's clearly a job for by
, tapply
, or aggregate
, but I've not been able to find the right way. by(gss_2, with(gss_2, list(age, marital)), length)
is a step in the right direction, but the output is awful.
We could use proportions
on the table
output after subset
ting to remove the NA
(complete.cases
) and select
ing the columns
The data is from forcats
package. So, load the package and get the data
library(forcats)
data(gss_cat)
Use the table/proportions
as mentioned above
by_age2_base <- proportions(table(subset(gss_cat, complete.cases(age),
select = c(age, marital))), 1)
-output
head(by_age2_base, 3)
marital
age No answer Never married Separated Divorced Widowed Married
18 0.000000000 0.978021978 0.000000000 0.000000000 0.000000000 0.021978022
19 0.000000000 0.939759036 0.000000000 0.012048193 0.004016064 0.044176707
20 0.000000000 0.904382470 0.003984064 0.007968127 0.000000000 0.083665339
-compare with the OP's output
head(by_age2, 3)
# A tibble: 3 x 4
# Groups: age [2]
age marital n prop
<int> <fct> <int> <dbl>
1 18 Never married 89 0.978
2 18 Married 2 0.0220
3 19 Never married 234 0.940
If we need the output in 'long' format, convert the table
to data.frame
with as.data.frame
by_age2_base_long <- subset(as.data.frame(by_age2_base), Freq > 0)
Or another option is aggregate/ave
(use R 4.1.0
)
subset(gss_cat, complete.cases(age), select = c(age, marital)) |>
{\(dat) aggregate(cbind(n = age) ~ age + marital,
data = dat, FUN = length)}() |>
transform(prop = ave(n, age, FUN = \(x) x/sum(x)))