I have a list of actual values: Y, and a list of list: predictions, where each element is 100 predictions of that Y value of the same index.
How can I calculate the negative log likelihood of the predictions in Python? I'm guessing it will involve assuming the predictions are normally distributed and using the mean and variance.
There doesn't seem to be existing packages that do this.
You can use log_loss from sklearn. But that functions takes in two arrays of the same size. You have to take your list of actual Y values and repeat each element 100 times. And then you take your list of lists and flatten it into a single list. That way your two lists are aligned. Here is a mini-example of your problem with just 3 predictions per actual value instead of 100:
from sklearn.metrics import log_loss
y_true_raw = [ 1, 0, 0, 1, 0]
y_pred_raw = [
[0, 1, 1],
[0, 1, 0],
[1, 0, 0],
[1, 1, 1],
[0, 0, 0],
]
y_true = []
for label in y_true_raw:
for i in range(len(y_pred_raw[0])):
y_true.append(label)
y_pred = []
for label_list in y_pred_raw:
y_pred.extend(label_list)
log_loss( y_true, y_pred )
By the way, I am assuming you are using a stochastic model that can give out a different answer every time for a give input. Otherwise I wouldn't understand why you repeat predictions for a single data point.