
How to easily generate/simulate example data with different groups for modelling

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

  • There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

    set.seed(69) # Makes samples reproducible
    df <- data.frame(groups = rep(c("A", "B"), each = 100),
                     sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                             sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                     age = c(runif(100, 25, 75), runif(100, 50, 90)))

    And we can use the tidyverse to show it does what was expected:

    df %>% 
      group_by(groups) %>% 
      summarise(age = mean(age),
                percent_male = length(which(sex == "M")))
    #> # A tibble: 2 x 3
    #>   groups   age percent_male
    #>   <chr>  <dbl>        <int>
    #> 1 A       49.4           29
    #> 2 B       71.0           50