What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?
For example, say I had a a continue predictor and a categorical predictor
your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))
Which looks like this:
income country
1 62085.59 England
2 60806.33 England
3 60527.27 England
4 67112.64 USA
5 57675.92 USA
6 58128.44 USA
7 60822.47 South Africa
8 55805.80 South Africa
9 63982.99 South Africa
10 64555.45 Belgium
How would I prepare this to be entered in rstan
?
It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like
dummy_variables <- model.matrix(~ country, data = your_dataset)
Which will look like this
(Intercept) countryEngland countrySouth Africa countryUSA
1 1 1 0 0
2 1 1 0 0
3 1 1 0 0
4 1 0 0 1
5 1 0 0 1
6 1 0 0 1
7 1 0 1 0
8 1 0 1 0
9 1 0 1 0
10 1 0 0 0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like
X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)
Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as
data {
int<lower=1> N;
int<lower=1> K;
matrix[N,K] X;
vector[N] y;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
y ~ normal(X * beta, sigma); // likelihood
// priors
}
Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.