rlinear-discriminant

R Error : some group is too small for 'qda'


I used the MASS::qda() to find the classfier for my data and it always reported

`some group is too small for 'qda'

Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....

set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)

Here is my data example

>head (testData)
                              sequence mono_score eudi_score    type
PhHe_4822_404_76       DTRPTAPGHSPGAGH    51.4930   39.55000 monocot
SoBi_10_265860_58      QTESTTPGHSPSIGH    33.1408    2.23333 monocot
EuGr_5_187924_158        AFRPTSPGHSPGAGH    27.0000   54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH    20.6901   50.21670 eudicot
PoTr_Chr07_112594_90     DFRPTAPGHSPGVGH    43.8732   56.66670 eudicot
OrSa.JA_3_261556_75    GVRPTNPGHSPGIGH    55.0986   45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH    25.8169    2.50000 monocot

>testData$type <- as.factor (testData$type)

> dim (testData)
[1] 200   4

> levels (testData$type)
[1] "eudicot" "monocot" "other" 

> table (testData$type)
eudicot monocot   other 
    100     100       0

> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils

My R version is R 3.0.2.


Solution

  • tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.

    Here's a way to make up a data set that looks like yours:

    set.seed(101)
    mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
                     mono_score=runif(100,0,100),
                     dicot_score=runif(100,0,100))
    

    Some useful diagnostics:

    str(mytest)
    ## 'data.frame':    200 obs. of  3 variables:
    ## $ type       : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
    ##  $ mono_score : num  37.22 4.38 70.97 65.77 24.99 ...
    ##  $ dicot_score: num  12.5 2.33 39.19 85.96 71.83 ...
    summary(mytest)
    ##       type       mono_score      dicot_score     
    ##  dicot  :100   Min.   : 1.019   Min.   : 0.8594  
    ##  monocot:100   1st Qu.:24.741   1st Qu.:26.7358  
    ##                Median :57.578   Median :50.6275  
    ##                Mean   :52.502   Mean   :52.2376  
    ##                3rd Qu.:77.783   3rd Qu.:78.2199  
    ##                Max.   :99.341   Max.   :99.9288  
    ## 
    with(mytest,table(type))
    ## type
    ##   dicot monocot 
    ##    100     100 
    

    Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...

    This made-up example seems to work fine, so there must be something you're not showing us about your data set ...

    library(MASS)
    qda(type~mono_score+dicot_score,data=mytest)
    

    Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...

    bad <- transform(mytest,mono_score=factor(mono_score))
    qda(type~mono_score+dicot_score,data=bad)
    ## Error in qda.default(x, grouping, ...) : 
    ##    some group is too small for 'qda'