I have the following code that selects (4 rows of iris x 1000) *100 and calculates the bias of each column.
library(SimDesign)
library(data.table)
do.call(rbind,lapply(1:100, function(x) {
bias(
setDT(copy(iris))[as.vector(sapply(1:1000, function(X) sample(1:nrow(iris),4)))][
, lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
parameter=c(5,3,2,1), #parameter is the true population value used to calculate bias
type='relative' #denotes the type of bias being calculated
)
}))
This takes 1000 samples of 4 rows, calculates the mean by sample #, giving me 1000 means. The bias for the 1000 means is found for each column, and then is done 99 more times giving me a distribution of bias estimates for each column. This is mimicking a random sampling design. However, I also want to do this for a stratified design. So I use splitstackshape
's stratified
function.
do.call(rbind,lapply(1:100, function(x) {
bias(
setDT(copy(iris))[as.vector(sapply(1:1000, function(X) stratified(iris,group="Species", size=1)))][
, lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
parameter=c(5,3,2,1),
type='relative'
)
}))
I would've thought that it is just a matter of swapping out the functions, but I keep on getting errors (i is invalid type (matrix)). Perhaps in future a 2 column matrix could return a list of elements of DT . I think it might be something related to setDT, but I'm not really sure how to fix it. Anybody know where I'm going wrong?
I've split into a couple of functions for you
library(SimDesign)
library(data.table)
library(splitstackshape)
n
stratified samples of size sampsize
and return column means of those samplesget_samples <- function(n, sampsize=4) {
rbindlist(lapply(1:n, function(x) {
splitstackshape::stratified(iris, group="Species",sampsize)[, id:=x]
}))[, lapply(.SD, mean), by=.(Species, id)]
}
y
such iterations of these samplesget_bias_distribution <- function(y=100, samples_per_iter=50, size_per_iter=4) {
rbindlist(lapply(1:y, function(y) {
samples = get_samples(samples_per_iter, sampsize=size_per_iter)[, id:=NULL]
samples[, as.list(bias(
estimate=.SD,parameter=c(5,3,2,1),type="relative")*100),
by=.(Species)][, iter:=y]
}))
}
get_bias_distribution()
Output:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width iter
1: setosa -1.236667 22.61833 -26.70000 -39.69667 1
2: versicolor 46.476667 -11.99500 115.12833 16.82167 1
3: virginica 80.596667 -0.20000 180.21833 53.89000 1
4: setosa -1.513333 20.87000 -27.46167 -38.83667 2
5: versicolor 45.333333 -11.34833 112.84833 17.84500 2
---
296: versicolor 48.250000 -12.26833 113.37000 17.71167 99
297: virginica 77.366667 -2.87000 175.60000 53.07167 99
298: setosa -1.005000 22.67500 -27.02833 -39.69500 100
299: versicolor 47.921667 -10.28333 110.97833 16.86833 100
300: virginica 76.153333 -2.44000 174.46167 52.62167 100
stratified(iris,group="Species", size=1)
, you will get a 3 row data.table, because you are effectively selecting one row at random from each of the three Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 4.9 3.6 1.4 0.1 setosa
2: 6.3 2.5 4.9 1.5 versicolor
3: 7.7 2.8 6.7 2.0 virginica
sapply(1:1000, function(x)...)
, you get 5 x 1000 column matrix, where each column is contains 5 lists of length 3 .. Below, I'm showing you what this looks like if you did sapply(1:6, function(x)...)
[,1] [,2] [,3] [,4] [,5] [,6]
Sepal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Sepal.Width numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Width numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Species factor,3 factor,3 factor,3 factor,3 factor,3 factor,3
This is not really what you want, because you cannot then lapply
over these the way you then intended. What you want to do instead is use lapply(1:1000, function(x) ...)
to create a list of such 3-row datatables, and then bind them together (after adding an id
column to each one).