I have data set with two factors: Environments
(4 levels), Individuals
(500 in each environment) and response variable YD
. As part of my analysis I have to randomly sample 100 individuals from each Environment in the following way:
I already solved this problem using several lines of code, however I hope someone will help me generate an R function to do that, which will be very useful in other situations.
Here is example data set with similar structure:
library(BLR)
data (wheat)
Data <- melt(Y)
colnames(Data) <- c('Individuals','Environments','YD')
updated answer:
I just wrapped the answer in a function. Note, that is valid only for exactly 4 levels.
colnames(Data)<-c("Individuals","Environments","YD") #removed spaces from names
myfun <- function(DF, samplefrom, samplelevels, sampletype, samplesize)
{
if(sampletype == "per1")
{
Env1 = sample(unique(DF[[samplefrom]]), samplesize)
Env2 <- Env3 <- Env4 <- Env1
}
if(sampletype == "per4")
{
Env1 = sample(unique(DF[[samplefrom]]), samplesize)
Env2 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% Env1], samplesize)
Env3 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% c(Env1, Env2)], samplesize)
Env4 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% c(Env1, Env2, Env3)], samplesize)
}
if(sampletype == "per2")
{
Env1 = sample(unique(DF[[samplefrom]]), samplesize)
Env2 <- Env1
Env3 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% Env1], samplesize)
Env4 <- Env3
}
ret = do.call(rbind, mapply(function(ind, env) {df <- Data[DF[[samplelevels]] == env,];
df[df[[samplefrom]] %in% ind,]},
env = as.list(sample(unique(DF[[samplelevels]]))), ind = list(Env1, Env2, Env3, Env4),
SIMPLIFY = F)) #in `env = ` added `sample` to select the environments
#in random order and assign them the individuals
return(ret)
}
myfun(Data, "Individuals", "Environments", "per1", 2)
# Individuals Environments YD
#21 13954 1 0.6658681
#345 457982 1 -1.1022770
#620 13954 2 -0.4888968
#944 457982 2 0.6026167
#1219 13954 4 -0.7183965
#1543 457982 4 0.4881141
#1818 13954 5 0.2660623
#2142 457982 5 -2.0626073
myfun(Data, "Individuals", "Environments", "per2", 2)
# Individuals Environments YD
#25 15292 1 -1.1272386
#248 373045 1 -0.6659416
#624 15292 2 -0.2362053
#847 373045 2 0.5778210
#1260 62150 4 1.2077921
#1654 1541043 4 1.1406084
#1859 62150 5 -0.3358584
#2253 1541043 5 0.3897426
myfun(Data, "Individuals", "Environments", "per4", 2)
# Individuals Environments YD
#106 85786 1 1.4480500
#567 3830162 1 -1.8052577
#1029 1301802 2 0.2737786
#1043 1410845 2 1.0617118
#1630 1302304 4 0.6673241
#1678 1766332 4 -0.0451913
#1871 65315 5 -0.0597450
#2336 2621166 5 2.5590801
update 2 some comments
mapply
applies a function sequentially to multiple arguments. Here, the function takes two arguments: ind
and env
. The function 1) subsets the dataframe by env
and 2) subsets the subsetted dataframe by ind
. env
is an environment and ind
is the sample of individuals (Env1
, ...) previously calculated in myfun
. The multiple arguments of the function to be mapplied
are env
: [1, 2, 3, 4] and ind
: [Env1, Env2, Env3, Env4]. mapply
takes sequentially env = 1
and ind = Env1
, env = 2
and ind = Env2
etc, and gives the result (the necessary subsets) in a list. do.call(rbind,)
joins the list in a dataframe output.
P.S. Note that because sample
is used env
can be [1, 2, 3, 4] or [2, 4, 3, 1] or whatever and so the sequential combination of the function's (to be mapplied
) arguments is not only env = 1
and ind = Env1
but env = 1 or 2 or 3 or 4
and ind = Env1
, and so on.
update 3 and 4 function with different No levels
No_different_samples
is the number of different samples you wish to take; I made it to default to the number of samplelevels
(i.e. a different sample for every level). I made the function to give an error if the No_different_samples
can't fit inthe No levels (i.e. if you want 3 different samples from a population with 4 levels (as your Data
), it throws an error; you have to select either 1 or 2 or 4.
myfun2 <- function(DF, samplefrom, samplelevels,
No_different_samples = NULL, grouping = NULL, samplesize)
{
samp <- sample(unique(DF[[samplefrom]]))
levs <- unique(DF[[samplelevels]])
if(is.null(No_different_samples)) No_different_samples <- length(levs)
if(is.null(grouping)) grouping <- c(1, 1, 1, 1)
if(length(levs) %% No_different_samples) stop("an error message here")
if(length(samp) < No_different_samples * samplesize)
stop("can't take a sample this large from the population")
ls_diffr_samps <- vector("list", length = No_different_samples)
for(i in 1:No_different_samples)
{
ls_diffr_samps[[i]] <- samp[(i * samplesize - (samplesize - 1)) : (i * samplesize)]
}
list_samples <- rep(ls_diffr_samps, times = grouping)
ret = do.call(rbind, mapply(function(ind, env) {df <- DF[DF[[samplelevels]] == env,];
df[df[[samplefrom]] %in% ind,]},
env = as.list(sample(levs)), ind = list_samples,
SIMPLIFY = F))
return(ret)
}
myfun2(Data, "Individuals", "Environments", 1, 4, 2) #same sample for all
myfun2(Data, "Individuals", "Environments", 2, c(2, 2), 2) #same sample per 2
myfun2(Data, "Individuals", "Environments", 2, c(3, 1), 2) #same sample for 3
myfun2(Data, "Individuals", "Environments", 4, c(1, 1, 1, 1), 2) #different sample for all