I am working on a multinomial logistic regression problem (i.e., where I want to classify some unordered, independent levels of a nominal outcome variable). My issue is that I know the levels of the outcome variable (in this example, y=c('a','b','c')
) and I know the predictor variables, their levels, and their class (here, either numeric/integer or nominal). I know what the approximate distributions should be between each predictor and the outcome (e.g., higher values of x
appear more frequently with y='a'
, otherwise low values of x
are randomly distributed across the other levels of y
).
Essentially, I want to do 4 things: 1) generate a dataset of these variables that approximate my specified distributions; 2) run a multinomial logistic regression on the data, nnet::multinom(y~.,df)
; 3) use the resulting model to predict()
the probability of each y
level using new data; and 4) retrieve the probabilities for further processing. I am not interested in the MLR model's accuracy or p-values, so I don't need to split my data into train/test samples and do k-folds cross-validation or anything.
My initial thought was that this type of reverse engineering a dataset based on some user-specified distributions cannot be too uncommon, and that there was probably an R
package/function that could do this. I haven't found any so far. My approaches so far have been as follows: manually specify the distributions for each level of each predictor against each level of the outcome, like so:
rm(list=ls())
set.seed(123)
# specify vars and levels -- y=outcome var
y <- c('a','b','c')
x <- c(1:5)
p <- c(1:4)
r <- c(1:8)
q <- c('foo','bar','hello','world') # nominal var
# sample data based on user-specified distributions/probs
df1 <- data.frame(x1=sample(x,100,T,prob=c(0.1,0.1,0.2,0.25,0.35)),
y='b')
df2 <- data.frame(x1=sample(x,200,T,prob=c(0.35,0.25,0.2,0.1,0.1)),
y=sample(c('a','c'),200,T))
df <- rbind(df1,df2)
# check distribution of x1 levels v. y levels
table(df$x1,df$y)
b a c
1 7 38 30
2 11 29 26
3 22 17 22
4 26 14 7
5 34 12 5
The issue is that this is tedious as the number of predictors becomes larger and they have more levels. My next approach was to generate a random sample of data, run the MLR model, and tweak the model weights.
# create random sample
df <- ldply(mget(ls()),
function(x) sample(x,1000,T)) %>%
gather(k,v,-`.id`) %>%
spread(`.id`,v) %>% select(-k)
str(df)
# change back vars to numeric
df[,c('p','r','x')] <-
apply(df[,c('p','r','x')],2,function(x) as.numeric(x))
glimpse(df)
Observations: 1,000
Variables: 5
$ p <dbl> 2, 2, 3, 1, 3, 2, 2, 4, 2, 4, 4, 3, 2, 4, 1, 4, 2, 1, 4, 3, 1, 3, 4, 3, 2, 2, 3...
$ q <chr> "bar", "bar", "foo", "bar", "world", "hello", "foo", "hello", "world", "hello",...
$ r <dbl> 2, 2, 1, 6, 6, 3, 4, 8, 6, 6, 2, 2, 8, 7, 7, 6, 3, 2, 4, 5, 2, 7, 1, 6, 3, 7, 8...
$ x <dbl> 2, 5, 1, 3, 3, 5, 2, 4, 1, 3, 5, 1, 5, 5, 2, 1, 1, 4, 4, 1, 5, 1, 5, 4, 4, 3, 2...
$ y <chr> "a", "c", "b", "a", "b", "a", "b", "c", "c", "b", "c", "c", "b", "a", "c", "b",...
# graph distribution of each predictor against each outcome -- not run here
# df %>% gather(k,v,-y) %>% group_by(y,k,v) %>%
# summarise(n=n()) %>%
# mutate(prop=n/sum(n)) %>%
# ggplot(aes(y,prop,fill=v)) + geom_bar(stat='identity',position='dodge') +
# facet_wrap(~k,scales='free') + theme(legend.position = 'none')
# run MLR model
m <- multinom(y~.,df)
summary(m)$coefficients
m$wts # coefficients from model
# adjust weight 16, which is x against y=b
m$wts[16] <- 1
Again, though, this is tedious when the number of predictors and levels is large. Plus as I continue altering the model weights and predicting new data, I get some unexpected probabilities (obviously, I don't know enough about MLR to confidently use this method).
So, I am more or less stuck at this stage. I have considered using multiple imputation or bootstrapping methods to generate the desired data, but I don't think either method is applicable here. MI will impute data for incomplete cases, whereas I want to specify a limited number of complete cases and extrapolate from there. Similarly, bootstrapping will resample the data assuming that the sample distribution approximates the population distribution. Again, I don't see how I can specify a limited number of cases that will validly do that (perhaps bootstrapping plus permutation/shuffling ?).
Anyways, any help/suggestions are greatly appreciated here. And thanks to anyone that actually reads this lengthy post!
So, my solution has been to modify the randomly-generated data, then use the modified data (which better approximates the desired distributions) to run the MLR model.
I created two functions, one to revalue numeric variables and another to revalue nominal variables. The numeric revalue function allows the user to specify which direction the predictor's values should be redistributed and if they should apply or exclude a specified level of the outcome variable. The functions below are tested out on the example data included in the question.
When I then go back and run the MLR model and predict new data, I get different probabilities for each outcome which better match my expectations.
# redistribute values for specific predictors -----------------------------
# at specific levels of outcome var
####
# define function for numeric var
revalue.nums <- function(data,yvar.name,yvar.level,xvar.name,
direction=1, yvar.level.opposite=FALSE){
# evaluate dir==-1 & oppo==T first, then dir==-1 & oppo==F,
# then dir==1 & oppo==T, finally dir==1 & oppo==F
if (direction==-1 & yvar.level.opposite==TRUE) {
data[[xvar.name]][data[[yvar.name]] != yvar.level] <-
sample(get(xvar.name),
length(data[[xvar.name]][data[[yvar.name]] != yvar.level]), T,
prob = c(seq(from=max(get(xvar.name)),
to=min(get(xvar.name))) / sum(get(xvar.name))))
return(data)
} else if (direction==-1 & yvar.level.opposite==FALSE) {
data[[xvar.name]][data[[yvar.name]]==yvar.level] <-
sample(get(xvar.name),
length(data[[xvar.name]][data[[yvar.name]]==yvar.level]), T,
prob = c(seq(from=max(get(xvar.name)),
to=min(get(xvar.name))) / sum(get(xvar.name))))
return(data)
} else if (direction==1 & yvar.level.opposite==TRUE) {
data[[xvar.name]][data[[yvar.name]] != yvar.level] <-
sample(get(xvar.name),
length(data[[xvar.name]][data[[yvar.name]] != yvar.level]), T,
prob = c(seq(from=min(get(xvar.name)),
to=max(get(xvar.name))) / sum(get(xvar.name))))
return(data)
} else {
data[[xvar.name]][data[[yvar.name]]==yvar.level] <-
sample(get(xvar.name),
length(data[[xvar.name]][data[[yvar.name]]==yvar.level]), T,
prob = c(seq(from=min(get(xvar.name)),
to=max(get(xvar.name))) / sum(get(xvar.name))))
return(data)
}
}
####
# define function
revalue.chars <- function(data,yvar.name,yvar.level,xvar.name,xvar.level,probs=0.25){
data[[xvar.name]][data[[yvar.name]] == yvar.level] <-
sample(sort(sub(xvar.level,'1',get(xvar.name))),
length(data[[xvar.name]][data[[yvar.name]] == yvar.level]), T,
prob = c(probs, rep(probs / (length(get(xvar.name))-1),
rep(length(get(xvar.name))-1))))
data[[xvar.name]][data[[xvar.name]] == '1'] <- xvar.level
return(data)
}
###
# test functions on toy data
table(df$y,df$p) # orig
df1 <- revalue.nums(df,'y','a','p')
table(df1$y,df1$p) # changes y=a only, skew p to have higher values
df1 <- revalue.nums(df1,'y','a','p',yvar.level.opposite = T,direction = -1)
table(df1$y,df1$p) # changes y!=a, skew p to have lower values
table(df$y,df$q)
df2 <- revalue.chars(df,'y','b','q','hello',probs=0.5)
table(df2$y,df2$q) # increase num of q=hello and y=b occurrences