This question is a more specific and simplified version of this one.
The dataset I'm using is too large for a single lm
or speedlm
calculation.
I want to split up my data set in smaller pieces but in doing this, one(or more) of the columns only contains one factor.
The code below is the mininum to reproduce my example. On the bottom of the question I will put my testing script for those interested.
library(speedglm)
iris$Species <- factor(iris$Species)
i <- iris[1:20,]
summary(i)
speedlm(Sepal.Length ~ Sepal.Width + Species , i)
This gets me the following error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I have tried to factorize iris$Species
but without success. I really don't have a clue how I could fix this now.
How can I include Species
into the model? (without increasing the sample size)
Edit:
I know I only have one level: "setosa" but I still need it included in the linear model because I will update the model with more factors eventually, as seen in the example script below
For those interested, here is an example script of what I will use for my actual dataset:
library(speedglm)
testfunction <- function(start.i, end.i) {
return(iris[start.i:end.i,])
}
lengthdata <- nrow(iris)
stepsize <- 20
## attempt to factor
iris$Species <- factor(iris$Species)
## Creates the iris dataset in split parts
start.i <- seq(0, lengthdata, stepsize)
end.i <- pmin(start.i + stepsize, lengthdata)
dat <- Map(testfunction, start.i + 1, end.i)
## Loops trough the split iris data
for (i in dat) {
if (!exists("lmfit")) {
lmfit <- speedlm(Sepal.Length ~ Sepal.Width + Species , i)
} else if (!exists("lmfit2")) {
lmfit2 <- updateWithMoreData(lmfit, i)
} else {
lmfit2 <- updateWithMoreData(lmfit2, i)
}
}
print(summary(lmfit2))
There might be a better way, but if you reorder your rows, each split will contain more levels, and therefore not cause the error. I created a random order, but you might want to do a more systematic way.
library(speedglm)
testfunction <- function(start.i, end.i) {
return(iris.r[start.i:end.i,])
}
lengthdata <- nrow(iris)
stepsize <- 20
## attempt to factor
iris$Species <- factor(iris$Species)
##Random order
set.seed(1)
iris.r <- iris[sample(nrow(iris)),]
## Creates the iris dataset in split parts
start.i <- seq(0, lengthdata, stepsize)
end.i <- pmin(start.i + stepsize, lengthdata)
dat <- Map(testfunction, start.i + 1, end.i)
## Loops trough the split iris data
for (i in dat) {
if (!exists("lmfit")) {
lmfit <- speedlm(Sepal.Length ~ Sepal.Width + Species , i)
} else if (!exists("lmfit2")) {
lmfit2 <- updateWithMoreData(lmfit, i)
} else {
lmfit2 <- updateWithMoreData(lmfit2, i)
}
}
print(summary(lmfit2))
Edit Instead of the random order, you can use modulo division to generate a spred out index vector in a systematic way:
spred.i <- seq(1, by = 7, length.out = 150) %% 150 + 1
iris.r <- iris[spred.i,]