I'm working on psychological assessment data that was collected in a non-random sample design. My dataframe is formed of "sex" (male and female) and "level of education"("elementary", "high school", "college"). However, my empirical distribution is different from the true distribution.
I know that true parameter for sex is 0.7 female and 0.3 male. I know as well that the true parameter for schooling is elementary equals 0.5, high school = 0.3, and college equal 0.2
I would like to have a code that could "cut" (adjust?) my dataframe to match these characteristics. I know my final dataframe will have fewer participants than my current one. I'm wondering if a for / loop solution is duable in this case.
Dat:
df2 = data.frame(
sex = rep(c("m","f"),135),
schooling = c("elementary","highschool","college")
)
prop.table(table(df2$sex))
prop.table(table(df2$schooling))
You could weight your observations by your desired proportions, then use dplyr::slice_sample()
:
set.seed(13)
library(dplyr)
prop_sex <- c(f = 0.7, m = 0.3)
prop_school <- c(elementary = 0.5, highschool = 0.3, college = 0.2)
df3 <- df2 %>%
mutate(wt = prop_sex[sex] * prop_school[schooling]) %>%
slice_sample(prop = 0.6, weight_by = wt)
prop.table(table(df3$sex))
# f m
# 0.617284 0.382716
prop.table(table(df3$schooling))
# college elementary highschool
# 0.2716049 0.4259259 0.3024691
Depending on the level of precision you want, you could iterate until the proportions fall within a certain tolerance of your targets.
df2 <- mutate(df2, wt = prop_sex[sex] * prop_school[schooling])
max_iter <- 1000
tol <- 0.05
for (i in seq(max_iter)) {
df3 <- slice_sample(df2, prop = 0.6, weight_by = wt)
obs_sex <- prop.table(table(df3$sex))
obs_school <- prop.table(table(df3$schooling))
if (
max(abs(obs_sex - prop_sex[names(obs_sex)])) < tol &&
max(abs(obs_school - prop_school[names(obs_school)])) < tol
) break
if (i == max_iter) warning("max iterations reached")
}
prop.table(table(df3$sex))
# f m
# 0.6728395 0.3271605
prop.table(table(df3$schooling))
# college elementary highschool
# 0.2469136 0.4567901 0.2962963
You may also want to look into the survey or srvyr packages for working with weighted data.