rfunctionbirthday-paradox

birthday paradox function in R


I'm a beginner in R and am trying to create a birthday paradox function and managed to reach this point, and the result is approximately 0.5, as expected.

k <- 23 
sims <- 1000 
event <- 0 
for (i in 1:sims) {
  days <- sample(1:365, k, replace = TRUE)
  days.unique <- unique(days) 
  if (length(days.unique) < k) {
    event <- event + 1 } 
  answer <- event/sims}
  answer

However, when I tried to put that into a function, the result was always 0.001. Here is the code:

bdayfunction<- function(k){
 sims <- 1000 
 event <- 0 
 for (i in 1:sims) {
  days <- sample(1:365, k, replace = TRUE)
  days.unique <- unique(days) 
  if (length(days.unique) < k) {
    event <- event + 1 } 
 answer <- event/sims
 return (answer)
 }
}

What have I done wrong?


Solution

  • Your return is not in the right place: it is in the loop (the same holds for your answer calculation by the way).

    This works:

    bdayfunction<- function(k){
      sims <- 1000 
      event <- 0 
      for (i in 1:sims) {
        days <- sample(1:365, k, replace = TRUE)
        days.unique <- unique(days) 
        if (length(days.unique) < k) {
          event <- event + 1 }   
      }
      answer <- event/sims
      return (answer)
    }
    

    In R, you can make use of libraries that allows you to do grouping operation. The two main ones are data.table and dplyr. Here, instead of doing a loop, you could try to create a long data.frame with all your simulations, to then calculate the unique number of days per simulation and then count the number of occurrence below k. With dplyr:

    library(dplyr)
    
    bdayfunction_dplyr <- function(k){  
      df <- data.frame(sim = rep(1:sims,each = k),
                       days = sample(1:365, k*sims, replace = TRUE))
      return(
        df %>%
        group_by(sim) %>%
        summarise(plouf = length(unique(days))< k) %>%
        summarise(out = sum(plouf)/1000) %>%
        pull(out)
        )  
    }
    

    In data.table:

    library(data.table)
    
    bdayfunction_data.table <- function(k){
      dt <- data.table(sim = rep(1:sims,each = k),
                       days = sample(1:365, k*sims, replace = TRUE))
    
      return(dt[,length(unique(days)),sim][V1<k,.N/1000])
    }
    

    You can test that they provide the same result:

    set.seed(123)
    bdayfunction(23)
    [1] 0.515
    
    set.seed(123)
    bdayfunction_dplyr(23)
    [1] 0.515
    
    set.seed(123)
    bdayfunction_data.table(23)
    [1] 0.515
    

    Now lets compare the speed:

    library(microbenchmark)
    
    microbenchmark(initial = bdayfunction(23),
                   dplyr = bdayfunction_dplyr(23),
                   data.table = bdayfunction_data.table(23))
    
    Unit: milliseconds
           expr     min       lq      mean  median       uq      max neval cld
        initial  7.3252  7.56900  8.435564  7.7441  8.15995  24.7681   100  a 
          dplyr 12.3488 12.96285 16.846118 13.3777 14.71370 295.6716   100   b
     data.table  5.9186  6.24115  6.540183  6.4494  6.75640   8.1466   100  a 
    

    You see that data.table is slightly faster than your initial loop, and shorter to write.