I am using Vowpal Wabbit's contextual bandit to rank various action given a context.
Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"
Test Data:
" | 123"
Now, the expected ranking of action should be (from least loss to most loss):
7 6 5 4 3 2 1
Using --cb
just returns the most optimal action:
7
And using --cb_explore
returns a pdf of the actions to be explored but it doesn't seem to help in ranking.
[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]
Is there any other way of using vw's contextual bandit for ranking?
Olga's response on the repo: https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555
--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted
--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here
cb_explore's output is the PMF given by the exploration strategy (see here for more info).
Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).
So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.
One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.
And my follow up:
I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.