I am trying to implement an algorithm for sampling in several stages where only the final size of the sample is known.
Here is an example of the structure of my sampling frame. Where:
Then, the algorithm have the next steps: Given a sample size $n$
Because
cluster total_households group Probability
1 173494 13 2 4.055410e-01
2 173495 19 5 4.176953e-02
3 173496 22 5 4.176953e-02
4 173497 21 5 4.176953e-02
5 173498 18 5 4.176953e-02
6 173499 27 7 6.775638e-05
7 173500 15 4 5.020529e-01
8 173501 19 5 4.176953e-02
I want to implement this algorithm with R. I know there is a package for this called sampling with the multistage function, but it does not work. Because, I must specify the number of clusters and groups before implementing the algorithm. My programming skills are limited. I've been trying to do something with a while loop, but I think I'm far from the correct result.
require(dplyr) # to use pipes in the code
n_sample = 844
group = NULL
total = NULL
cluster = NULL
total_households = NULL
total = 0
i = 1
while(total < n_sample){
group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
total_households = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(total_households)
cluster[i] = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(cluster) %>% as.numeric()
data = data[data$cluster!=cluster[i],]
total = total+total_households
i = i+1
}
You are pretty close to what you want to achieve (leaving aside the tidiness of code and focusing on numbers):
Firstly, lets correct the while loop: ( 2 modifications)
while(total < n_sample){
group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
total_households = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(total_households) %>% as.numeric() # Mod_1
cluster[i] = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(cluster) %>% as.numeric()
data = data[data$cluster!=cluster[i],]
total = total+ (total_households*0.25) # Mod_2
i = i+1
}
Note that you will end up with a total > n , but you can always adjust it to be equal n by modifying the no of households from last cluster in the list.
Secondly, Important thing you need to take into consideration is that the sum of probabilities for the groups should add to 1 throughout the algorithm.