I used the MASS::qda()
to find the classfier for my data and it always reported
`some group is too small for 'qda'
Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....
set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)
Here is my data example
>head (testData)
sequence mono_score eudi_score type
PhHe_4822_404_76 DTRPTAPGHSPGAGH 51.4930 39.55000 monocot
SoBi_10_265860_58 QTESTTPGHSPSIGH 33.1408 2.23333 monocot
EuGr_5_187924_158 AFRPTSPGHSPGAGH 27.0000 54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH 20.6901 50.21670 eudicot
PoTr_Chr07_112594_90 DFRPTAPGHSPGVGH 43.8732 56.66670 eudicot
OrSa.JA_3_261556_75 GVRPTNPGHSPGIGH 55.0986 45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH 25.8169 2.50000 monocot
>testData$type <- as.factor (testData$type)
> dim (testData)
[1] 200 4
> levels (testData$type)
[1] "eudicot" "monocot" "other"
> table (testData$type)
eudicot monocot other
100 100 0
> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils
My R version is R 3.0.2.
tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.
Here's a way to make up a data set that looks like yours:
set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
mono_score=runif(100,0,100),
dicot_score=runif(100,0,100))
Some useful diagnostics:
str(mytest)
## 'data.frame': 200 obs. of 3 variables:
## $ type : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
## $ mono_score : num 37.22 4.38 70.97 65.77 24.99 ...
## $ dicot_score: num 12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
## type mono_score dicot_score
## dicot :100 Min. : 1.019 Min. : 0.8594
## monocot:100 1st Qu.:24.741 1st Qu.:26.7358
## Median :57.578 Median :50.6275
## Mean :52.502 Mean :52.2376
## 3rd Qu.:77.783 3rd Qu.:78.2199
## Max. :99.341 Max. :99.9288
##
with(mytest,table(type))
## type
## dicot monocot
## 100 100
Importantly, the first two (str()
and summary()
) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel()
function should take care of this problem ...
This made-up example seems to work fine, so there must be something you're not showing us about your data set ...
library(MASS)
qda(type~mono_score+dicot_score,data=mytest)
Here's a guess. If your score
variables were actually factors rather than numeric, then qda
would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...
bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) :
## some group is too small for 'qda'