I am comparing two linear models in R with Anova, and I would like to do the same thing in Java. To simplify it, I took the example code from https://stats.stackexchange.com/questions/48854/why-am-i-getting-different-intercept-values-in-r-and-java-for-simple-linear-regr and modified it a bit below. The models are test_trait ~ geno_A + geno_B
and test_trait ~ geno_A + geno_B + geno_A:geno_B
. The coefficients of the models implemented in R and Java are the same. In R I use anova(fit, fit2)
where the fits are the results of lm and in Java I use TestUtils.oneWayAnovaPValue
from org.apache.commons.math3
.
With R I get a pvalue of 0.797
, while with Java I get a pvalue of 0.817
, so this is not the right method, but I can not find how to do it correctly. Is there an equivalent of R's anova.lm
in Java?
The full code is below.
R
test_trait <- c( -0.48812477 , 0.33458213, -0.52754476, -0.79863471, -0.68544309, -0.12970239, 0.02355622, -0.31890850,0.34725819 , 0.08108851)
geno_A <- c(1, 0, 1, 2, 0, 0, 1, 0, 1, 0)
geno_B <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
fit <- lm(test_trait ~ geno_A+geno_B)
fit2 <- lm(test_trait ~ geno_A + geno_B + geno_A:geno_B)
which gives the coefficients
> fit
Call:
lm(formula = test_trait ~ geno_A + geno_B)
Coefficients:
(Intercept) geno_A geno_B
-0.03233 -0.10479 -0.60492
> fit2
Call:
lm(formula = test_trait ~ geno_A + geno_B + geno_A:geno_B)
Coefficients:
(Intercept) geno_A geno_B geno_A:geno_B
-0.008235 -0.152979 -0.677208 0.096383
And the Anova
> anova(fit, fit2) # 0.797
Analysis of Variance Table
Model 1: test_trait ~ geno_A + geno_B
Model 2: test_trait ~ geno_A + geno_B + geno_A:geno_B
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 0.77982
2 6 0.77053 1 0.0092897 0.0723 0.797
Java
double [] y = {-0.48812477, 0.33458213,
-0.52754476, -0.79863471,
-0.68544309, -0.12970239,
0.02355622, -0.31890850,
0.34725819, 0.08108851};
double [][] x = {{1,0}, {0,0},
{1,0}, {2,1},
{0,1}, {0,0},
{1,0}, {0,0},
{1,0}, {0,0}};
double [][] xb = {{1,0,0}, {0,0,0},
{1,0,0}, {2,1,2},
{0,1,0}, {0,0,0},
{1,0,0}, {0,0,0},
{1,0,0}, {0,0,0}};
OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double[] beta = regr.estimateRegressionParameters();
System.out.printf("First model: y = int + genoA + genoB\n");
System.out.printf("Intercept: %.3f\t", beta[0]);
System.out.printf("beta1: %.3f\t", beta[1]);
System.out.printf("beta2: %.3f\n\n", beta[2]);
regr.newSampleData(y, xb);
double[] betab = regr.estimateRegressionParameters();
System.out.printf("Second model: y = int + genoA + genoB + genoA:genoB\n");
System.out.printf("Intercept: %.3f\t", betab[0]);
System.out.printf("beta1: %.3f\t", betab[1]);
System.out.printf("beta2: %.3f\t", betab[2]);
System.out.printf("beta2: %.3f\n", betab[3]);
Which gives the same coefficients as in R
First model: y = int + genoA + genoB
Intercept: -0.032 beta1: -0.105 beta2: -0.605
Second model: y = int + genoA + genoB + genoA:genoB
Intercept: -0.008 beta1: -0.153 beta2: -0.677 beta2: 0.096
But the Anova gives a different result
List classes = new ArrayList();
classes.add(beta);
classes.add(betab);
double pvalue = TestUtils.oneWayAnovaPValue(classes);
double fvalue = TestUtils.oneWayAnovaFValue(classes);
System.out.println(pvalue);
System.out.println(fvalue);
0.8165390406874127
0.05979444576790511
You're very much misunderstanding ANOVA in the case where you compare two regressions. That's not an ANOVA in the sense of oneWayAnova
. The equivalent of onewayAnova
in R is the function aov
. The function anova
on the other hand implements a multitude of tests for comparison of models, and the name anova
is confusing to say the least...
If you compare two regression models, you want to do an F test on sums of squares. What you do in your code, is an one-way ANOVA to see if the two sets of regression parameters differ significantly. That's not what you want to do, but that's exactly what your JAVA code is doing.
In order to calculate the correct F test, you need to do the following:
calculate MSE for the largest model by dividing the Residual Sum of Squares (RSS) by the degrees of freedom (df) (in the R table : 0.77053 / 6
calculate the MSEdifference by substracting the RSS of both models (result is "Sum of Sq." in the R table), substracting the df for both models (result is "Df" in the R table), and divide these numbers.
Divide 2 by 1 and you have the F value
calculate the p-value using the F value in 3 and for df the df-difference in the numerator and df of the largest model in the denominator.
As far as I know, the class OLSMultipleLinearRegression doesn't have any convenient methods to extract the number of degrees of freedom, so this is not straight-forward to do in Java. You'll have to calculate the df manually and then use the class FDistribution
to calculate the p value.
eg:
OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double SSR1 = regr.calculateResidualSumOfSquares();
double df1 = y.length - (x[0].length + 1);
//df = n - number of coefficients, including intercept
regr.newSampleData(y, xb);
double SSR2 = regr.calculateResidualSumOfSquares();
double df2 = y.length - (xb[0].length + 1);
double MSE = SSR2/df2; // EDIT: You need the biggest model here!
double MSEdiff = Math.abs ((SSR2 - SSR1) / (df2 - df1));
double dfdiff = Math.abs(df2 - df1);
double Fval = MSEdiff / MSE;
FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulativeProbability(Fval);
Now both the F value and the p value should be exactly what you see in the anova() table of R. df1
and df2
are the column Res.Df
in the R table, the difference should be Df
in the R table, and MSEdiff
should be the same as Sum of Sq.
divided by Df
from the R table.
Disclaimer: I'm a bad JAVA programmer, so code above is more conceptual than actual code. Please look for typos or stupid mistakes and check the documentation of the FDistribution
class I used here :
And now you know why statisticians use R instead of Java ;-)
EDIT : The FDistribution
used in the code above is the class
org.apache.commons.math3.distribution.FDistribution
There's also an FDistribution in JSci :
JSci.maths.statistics.FDistribution
If you use that one, the last part of the code becomes:
FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulative(Fval);
Depending on the exact implementation, the cumulative probabilities might differ slightly. I have alas no idea about the difference and/or which one can be trusted better.