I have a likelihood function that contains a bivariate normal CDF. I keep getting values close to one for the correlation, even when the true value is zero.
The R package sampleSelection maximizes a likelihood function that contains a bivaraite normal CDF (as in Van de Ven and Van Praag (1981)). I tried looking at the source code for the package, but couldn't find how they write the likelihood. For reference, Van de Ven and Van Praag's paper:
The Demand for Deductibles in Private Health Insurance: A Probit Model with Sample Selection.
The likelihood function is Equation (19), where H denotes the standard normal CDF and H_2 denotes the bivariate normal CDF.
My question:
Can someone tell me how the likelihood function is written in the sampleSelection package? or
Can someone tell me why I'm getting values of close to one for the correlation in the code below?
Here's the code that's keeping me up at night:
########################################################
#
# Trying to code Van de Ven and Van Praag (1981)
#
#
########################################################
library(MASS)
library(pbivnorm)
library(mnormt)
library(maxLik)
library(sampleSelection)
set.seed(1)
# Sample size
full_sample <- 1000
# Parameters
rho <- .1
beta0 <- 0
beta1 <- 1
gamma0 <- .2
gamma1 <- .5
gamma2 <- .5
varcovar <- matrix(c(1,rho,rho,1), nrow = 2, ncol = 2)
# Vectors for storing values
y <- rep(0,full_sample)
s <- rep(0,full_sample)
outcome <- rep(0,full_sample)
select <- rep(0,full_sample)
#######################
# Simulate data
#######################
x <- rnorm(full_sample)
z <- rnorm(full_sample)
errors <- mvrnorm(full_sample, rep(0,2), varcovar)
# note: 1st element for selection eq; 2nd outcome
s <- gamma0 + gamma1*x + gamma2*z + errors[,1]
y <- beta0 + beta1*x + errors[,2]
for(i in 1:full_sample){
if(s[i]>=0){
select[i] <- 1
if(y[i]>=0){
outcome[i] <- 1
}else{
outcome[i] <- 0
}
}else{
outcome[i] <- NA
select[i] <- 0
}
}
#######################################
# Writing the log likelihood
##
# Note: vega1= beta0,
# vega2= beta1,
# vega3= gamma0,
# vega4= gamma1,
# vega5= gamma2,
# vega6= rho
#######################################
first.lf <- function(vega) {
# Transforming this parameter becuase
# correlation is bounded between -1 aad 1
corr <- tanh(vega[6])
# Set up vectors for writing log likelihood
y0 <- 1-outcome
for(i in 1:full_sample) {
if(is.na(y0[i])){ y0[i]<- 0}
if(is.na(outcome[i])){ outcome[i]<- 0}
}
yt0 <- t(y0)
yt1 <- t(outcome)
missing <- 1 - select
ytmiss <- t(missing)
# Terms in the selection and outcome equations
A <- vega[3]+vega[4]*x+vega[5]*z
B <- vega[1]+vega[2]*x
term1 <- pbivnorm(A,B,corr)
term0 <- pbivnorm(A,-B,corr)
term_miss <- pnorm(-A)
log_term1 <- log(term1)
log_term0 <- log(term0)
log_term_miss <- log(term_miss)
# The log likelihood
logl <- sum( yt1%*%log_term1 + yt0%*%log_term0 + ytmiss%*%log_term_miss)
return(logl)
}
startv <- c(beta0,beta1,gamma0,gamma1,gamma2,rho)
# Maxmimizing my likelihood gives
maxLik(logLik = first.lf, start = startv, method="BFGS")
# tanh(7.28604) = 0.9999991, far from the true value of .1
# Using sampleSelection package for comparison
outcomeF<-factor(outcome)
selectEq <- select ~ x + z
outcomeEq <- outcomeF ~ x
selection( selectEq, outcomeEq)
# Notice the value of -0.2162 for rho compared to my 0.9999991
It happens that there is a typo in the paper in equation (19). The terms from i = N_1 + 1 to N should have -rho rather than rho. Hence, using
term0 <- pbivnorm(A,-B,-corr)
gives
maxLik(logLik = first.lf, start = startv, method="BFGS")
# Maximum Likelihood estimation
# BFGS maximization, 40 iterations
# Return code 0: successful convergence
# Log-Likelihood: -832.5119 (6 free parameter(s))
# Estimate(s): 0.3723783 0.9307454 0.1349979 0.4693686 0.4572421 -0.219618
as needed.