rdataframesample

Sample rows from a dataframe by id when some ids have more rows than others


this is very basic but I couldn't find an answer online. I use R and have a dataset like this (but much larger):

set.seed(123)
id<-c(1,1,1,2,2,3,3,3,3,3,4,5,5,6,6,6)
week<-c(1,2,3,1,2,1,2,3,4,5,1,1,2,1,2,3)
value<-rnorm(16, mean=5, sd=1)
mydf<-data.frame(id, week, value)

id refers to a particular person, so some individuals have more observations than others. I'd like to take a sample of individuals from the dataframe, but so that for each sampled individuals, all this individual's rows would be included into the sample. If I do

mydf[sample(nrow(mydf),3),]

I obviously just get three random rows, when I'd like to get, for instance

 id  week  value
  1    1  4.439524
  1    2  4.769823
  1    3  6.558708
  4    1  6.224082
  6    1  5.110683
  6    2  4.444159
  6    3  6.786913

How to sample rows with this constraint? Thank you in advance!


Solution

  • One option:

    # set seed for reproducibility
    set.seed(958) 
    
    # Sample size
    n <- 3
    # Take simple random sample from the ids present
    sampled_ids <- sample(unique(mydf$id), n)
    
    # Keep only rows of the sampled IDs
    mydf[mydf$id %in% sampled_ids, ]
    
    #    id week    value
    # 4   2    1 5.070508
    # 5   2    2 5.129288
    # 12  5    1 5.359814
    # 13  5    2 5.400771
    # 14  6    1 5.110683
    # 15  6    2 4.444159
    # 16  6    3 6.786913